A lot of this information is out of date, and not to be relied upon. However, it’s probably not worth bringing it up to date, since the last Template Deploy service is (at the time of writing) in the process of migrating to the Cloud Platform. I’m leaving this document as-is for now, but it will be removed upon the migration of the last service.
(to be de-commissioned as and when template-deploy support ceases)
Various procedures for supporting our production template-deploy services. If something goes wrong and you are on call you should be able to find the relevant information via the confluence links below
General Incident Procedure
If there is a problem with one of these projects the general procedure is:
Look at the problem and determine if the outage is a short blip or likely to last.
- If it’s just a blip then most likely nothing needs to be done.
Put the holding page up
- Each project should have a ‘sorry the service is currently unavailable’ holding page served from an alternative infrastructure for this case. Check the project’s github repository for instructions on how to enable it for that project.
Email and notify people
Keep people informed about the problem, let them know what sort of problem it is (i.e. hosting supplier problem, VM problem, app problem etc.) and if the holding page is up or not.
Each project should have an incident email list with everyone who needs to know about service outages. Check the project’s github repository for that address.
Generally we should keep the project owner and delivery manager (if we have these details) informed about how long we think it will take to fix the problem and if it is something we can fix ourselves or if we need to log a call with a third-party. Check the relevant run-book and/or the project github pages for project owner/delivery manager details. Additional informational can also be found here Products Hosted by Cloud Platform, also Other Services and Contact Service Managers (template-deploy)
Notify the relevant team via their slack channel so that the team knows that something is wrong, and also the latest status of the problem.
- Diagnose the problem:
Go to the alert URL. This may be found in different places depending on the alerting service and the nature of the alert. You might find it on the dashboard, in the PagerDuty or Slack #high-priority-alarms messages, or elsewhere. You may not find it at all.
NOTE: You may not get a correct URL. At the time of writing, preprod PVB only worked if called on /prisoner, but the alert shows the root URL.
Resolve the alert on pagerduty. You can do this by following the instruction in the text message on the Incident Response phone, or directly on PagerDuty. If you don’t, it will continue to re-alert.
Go to Pingdom > Monitoring and filter on the Down status. Apply any other filters that help you find the issue.
Click on the issue to open the full report.
Go to the URL section, copy the URL and open it in a browser to check it.
If the URL works as expected, click Test Check to retest and reset Pingdom.
If the URL does not work, does not work as expected, or Pingdom will not reset, investigate and fix.
- Fix the problem and test the service is healthy again. Depending on the service, you may need to update your /etc/hosts file to gain access to the site (e.g. where you do not have access to the internal DNS, but the service is delivered via an Apache/Nginx vhost).
- Take down the holding page
- Send a ‘resolved’ email and notify people
- Send a final email to the project incident list to let everyone know that the service is back up.
- Notify the relevant team via their slack channel so that the team knows that the problem is resolved and the service is back up