- Cordoning the oldest node
- Deleting any “stuck” pods
- Draining the node
- Terminating the node
Kops will then replace the node with a new one, via AWS auto-scaling.
Guidance for the Cloud Platform team, when the recycle-node pipeline job fails:
error: unable to drain node "ip-172-20-xxx-xxx.eu-west-2.compute.internal", aborting command...
The pipeline may fail if pods exist without being part of a replicaset. These are usually port-forward pods, used by developers to connect to an RDS instance from their development environment.
The error will be similar to this:
There are pending nodes to be drained: ip-172-20-123-244.eu-west-2.compute.internal error: pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): port-forward
We’re running the process in a relatively “gentle” way, so it just fails the pipeline, rather than force-draining the node (which would have destroyed these pods). However, there is a separate concourse pipeline job to delete any such pods after 24 hours, so it is generally safe to ignore these pipeline failures for a day or two. Or, you can manually delete the pod(s) as follows:
If the pipeline logs are not showing which pods are causing the issue, you can use the node name from above, to run;
kubectl --ignore-daemonsets --delete-local-data drain ip-172-20-123-244.eu-west-2.compute.internal
It will show a similar message to the one in the pipeline logs, but will also give the list of pods which are causing the error.
There are pending nodes to be drained: ip-172-20-123-244.eu-west-2.compute.internal error: cannot delete Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): licences-dev/port-forward
Then you can track down the namespace owner and ask them to clear up their pods by sending a message in the #ask-cloud-platform, something like:
The recycle-node job failed because of some manually-created pods. If any of these pods are yours, and you're not using them anymore, please could you delete them so that the next time the recycle-node job runs, it is able to replace the node? licences-dev/port-forward
Once the pod causing the issue gets deleted, you can re-run the pipeline or just wait for the next scheduled run.