It’s Just a Monitoring Change
There is somewhat of a trope in technology that suggests changes to monitoring systems are not necessarily held to the same high standard as changes to other, often more complex or revenue-generating, systems and services. The reality is that monitoring systems are complicated beasts, and should not be taken for granted; nor should we be complacent when making changes to what is still a production system. Never again will I utter the words “it’s just a monitoring change”. Never again will I fill in the Impact field of a change request with these words, because I have learned that even though it might be just a monitoring change, it can still take down your primary database and render your products unusable. Let’s rewind, and add a little context.
Part of our technology stack for backend account services (we’re talking super-backend, as in direct connections to the databases) uses an abstraction layer provided by a third-party. It’s closed source, but we have a route into them for bug fix and feature requests. One such feature request was to expose metrics to show the count of payments waiting to be fulfilled by one such utility server. As the year is 2020, this was delivered as a Prometheus query exporter application that could be scraped by our Prometheus instances. We could then graph these counts, and raise alerts when the fulfillment queue was growing beyond expectations, giving us additional visibility into problems.
query_exporter application was delivered back in July, and it has been deployed and running on the relevant servers ever since. The endpoint has been exposed (though network access is restricted) and when we manually curl the metrics endpoint we see the results we expect. The work to start scraping the endpoint was paused behind a larger piece of work to overhaul our Prometheus offering, and so was picked up again recently. Here is where there was a fundamental breakdown of communication in understanding exactly what the
query_exporter application actually does.
We assumed that the metrics endpoint would behave like others we have had experience with in the past, namely that requesting the endpoint would display the latest metrics gathered by the application. What was actually happening is that each time the endpoint was requested, it would make a request to the database which, while as efficient as it could be, still queried many millions of records. Keep this in mind as we go through the timeline of how this became a service affecting incident.
After deploying the new scrape target configuration to our test Prometheus instance, which was querying the endpoint on non-production servers, a change request was raised to begin scraping the metrics endpoint from our production Prometheus instance. The default values were used, changing only the
params values to take into account the URL of the endpoint. Most notably, the default settings for scrape interval (30s) and timeout (10s) remained. The change was carried out, and only when the config became active and Prometheus started scraping the targets in production did we realise that additional firewall rules were needed. The change was marked “partially successful” as the config was in place but the targets were not yet being scraped. Off the back of this change request, tickets to add the necessary firewall rules were raised and passed to the relevant teams to progress.
A side note about how our firewall requests tickets work. We raise a request with the requirements and other details (source, destination, ports, encryption, etc.) which gets passed to a human, then an automated process, then Security to review (not that Security aren’t human, but you know) before being implemented by an automated process. It makes firewall rules simpler to raise, implement, and manage than if the process was entirely human-driven.
A day or so later, at approximately 17:04, the firewall automation process implemented the rule that we’d asked for, and Prometheus could finally get at the URLs it was so desperate for. The more database-intensive query (getting all payment records for the previous week) took nearly as long as the interval to return results, and very soon was taking longer than the interval. Even though the Prometheus scrape was timing out after 10 seconds, the query was still running to completion on the database. After around 15 minutes we were in a situation where the database was overloaded by this query running multiple times, so much so that other services which use this database (such as login) were unable to interact with it. As for the incident process itself, this was ultimately rather mundane. Our products had customer-facing banners applied, to ensure that there would be no further organic interactions with our services and the database. Once this specific query was identified as being the apparent cause of the increase in database load, the application on the utility servers was stopped and regular service was resumed shortly afterwards.
As with all major incidents and service interruptions, a Post Incident Review was held which was open to the whole company. From this meeting, a number of excellent observations and areas for improvement were made, which I will attempt to summarise below:
- The level of testing that went into the impact of polling the metrics endpoint was not high enough. The change was applied in our non-production environments, followed by our disaster recovery environment, before making it to our customer-facing systems. Because our understanding of what the check does was incorrect, we didn’t anticipate additional load on the database, and so this was not tested for explicitly. There is not enough organic traffic in our non-production environments, and the schedule for the refreshing of the data from primary to secondary site was such that the number of records returned by the query was significantly lower than it would be in production.
- Firewall changes should not be responsible for putting features live. The change to put the check live was done one afternoon, and after noticing the additional firewall rules needed implementing, was left in a state whereby as soon as those rules were in place, the check would start working. This was intentional albeit the wrong thing to do, and again comes back to the understanding we had of exactly what the
query_exporterapplication does. This of course has a knock-on effect that we need to account for firewall requirements earlier in the process of a piece of work, and build those requirements into the scope of the work.
- Our default interval for polling endpoints should be higher, and we explicitly set it to a lower value if required. Naturally, now that we know the true nature of what this metrics query does, and its impact on the database, we no longer poll every 30 seconds. The fact remains though, that if our default polling interval had been five minutes, then the database would not have been dealing with an ever-increasing number of concurrent requests, and the impact would not have been nearly as bad.
- Given we have real time replication to secondary database clusters, the query should be running against these rather than the cluster that is serving live customer traffic.
A number of these improvements are already underway, and changes have been made to the
query_exporter to ensure that it doesn’t ever attempt to create more than two simultaneous connections to the database. As for the underlying cause of the incident (or the “root cause” if you insist on using such language), that has to be the fact that our assumptions as teams or individuals are ultimately formed by our past experiences. Given our past experiences would suggest that a metrics endpoint is relatively low-resource, we saw no problem with polling it as such a high rate. Engineers that work directly with the database, and have indeed already written their own exporters that make use of database queries, would know for certain that this should be handled with a lot more caution.
Above all else, this has been a very useful reminder that even what appears to be a simple change to monitoring might in fact be the thing that causes your revenue-generating services to grind to a halt.
EDIT (2021-02-22): The contents of this blog post have been discussed further on the Software Misadventures podcast.