Skip to main content

Recycle-node

The recycle-node pipeline runs every two hours between midnight and 6am, it executes the recycle-node.rb script to replace the oldest worker node by:

  • Cordoning the oldest node
  • Deleting any “stuck” pods
  • Draining the node
  • Terminating the node

Kops will then replace the node with a new one, via AWS auto-scaling.

Guidance for the Cloud Platform team, when the recycle-node pipeline job fails:

  • error: unable to drain node "ip-172-20-xxx-xxx.eu-west-2.compute.internal", aborting command...

The pipeline may fail if pods exist without being part of a replicaset. These are usually port-forward pods, used by developers to connect to an RDS instance from their development environment.

The error will be similar to this:

There are pending nodes to be drained:
 ip-172-20-123-244.eu-west-2.compute.internal
error: pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): port-forward

We’re running the process in a relatively “gentle” way, so it just fails the pipeline, rather than force-draining the node (which would have destroyed these pods). However, there is a separate concourse pipeline job to delete any such pods after 24 hours, so it is generally safe to ignore these pipeline failures for a day or two. Or, you can manually delete the pod(s) as follows:

If the pipeline logs are not showing which pods are causing the issue, you can use the node name from above, to run;

 kubectl --ignore-daemonsets --delete-local-data drain ip-172-20-123-244.eu-west-2.compute.internal

It will show a similar message to the one in the pipeline logs, but will also give the list of pods which are causing the error.

There are pending nodes to be drained:
 ip-172-20-123-244.eu-west-2.compute.internal
error: cannot delete Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): licences-dev/port-forward

Then you can track down the namespace owner and ask them to clear up their pods by sending a message in the #ask-cloud-platform, something like:

The recycle-node job failed because of some manually-created pods.

If any of these pods are yours, and you're not using them anymore, please could you delete them so that the next time the recycle-node job runs, it is able to replace the node?

licences-dev/port-forward

Once the pod causing the issue gets deleted, you can re-run the pipeline or just wait for the next scheduled run.

This page was last reviewed on 11 May 2021. It needs to be reviewed again on 11 August 2021 by the page owner #cloud-platform .
This page was set to be reviewed before 11 August 2021 by the page owner #cloud-platform. This might mean the content is out of date.