Skip to main content

Incident on 2023-01-11 - Cluster image pull failure due to DockerHub password rotation

  • Key events

    • First detected: 2023-01-11 14:22
    • Incident declared: 2023-01-11 15:17
    • Repaired: 2023-01-11 15:50
    • Resolved 2023-01-11 15:51
  • Time to repair: 1h 28m

  • Time to resolve: 1h 29m

  • Identified: Identified: Cloud Platform team member observed failed DockerHub login attempts error at 2023-01-11 14:22:

failed to fetch manifest: Head "https://registry-1.docker.io/v2/ministryofjustice/cloud-platform-tools/manifests/2.1": toomanyrequests: too many failed login attempts for username or IP address
  • Impact: Concourse and EKS cluster nodes unable to pull images from DockerHub for 1h 28m. ErrImagePull error reported by one user in #ask-cloud-platform at 2023-01-11 14:54.

  • Context:

    • 2023-01-11 14:22: Cloud Platform team member observed failed DockerHub login attempts error:
failed to fetch manifest: Head "https://registry-1.docker.io/v2/ministryofjustice/cloud-platform-tools/manifests/2.1": toomanyrequests: too many failed login attempts for username or IP address
  • 2023-01-11 14:34: Discovered that cluster DockerHub passwords do match the value stored in LastPass.
  • 2022-01-11 14:40 Concourse DockerHub password updated in cloud-platform-infrastructure terraform.tfvars repository.
  • 2023-01-11 14:51 Explanation revealed. DockerHub password was changed as part of LastPass remediation activities.
  • 2023-01-11 14:52 KuberhealtyDaemensetCheck reveals cluster is also unable to pull images https://mojdt.slack.com/archives/C8QR5FQRX/p1673448593904699
With error:
Check execution error: kuberhealthy/daemonset: error when waiting for pod to start: ErrImagePull
  • 2023-01-11 14:53 dockerconfig node update requirement identified
  • 2023-01-11 14:54 user reports ErrImagePull when creating port-forward pods affecting at least two namespaces.
  • 2023-01-11 14:56 EKS cluster DockerHub password updated in cloud-platform-infrastructure
  • 2023-01-11 15:01 Concourse plan of password update reveals launch-template will be updated, suggesting node recycle.
  • 2023-01-11 15:02 Decision made to update password in live-2 cluster to determine whether a node recycle will be required
  • 2023-01-11 15:11 Comms distributed in #cloud-platform-update and #ask-cloud-platform.
  • 2023-01-11 15:17 Incident is declared.
  • 2023-01-11 15:17 J Birchall assumes incident lead and scribe roles.
  • 2023-01-11 15:19 War room started
  • 2023-01-11 15:28 Confirmation that password update will force node recycles across live & manager clusters.
  • 2023-01-11 15:36 Decision made to restore previous DockerHub password, to allow the team to manage a clean rotation OOH.
  • 2023-01-11 15:40 DockerHub password changed back to previous value.
  • 2023-01-11 15:46 Check-in with reporting user that pod is now deploying - answer is yes.
  • 2023-01-11 15:50 Cluster image pulling observed to be working again.
  • 2023-01-11 15:51 Incident is resolved
  • 2023-01-11 15:51 Noted that live-2 is now set with invalid dockerconfig; no impact on users.
  • 2023-01-11 16:50 comms distributed in #cloud-platform-update.

    • Resolution: DockerHub password was restored back to value used by EKS cluster nodes & Concourse to allow an update and graceful recycle of nodes OOH.
    • Review actions: As part of remediation, we have switched from Dockerhub username and password to Dockerhub token specifically created for Cloud Platform. (Done)