Last week, on Monday the 4th of November, 2019, we had a seven hour outage. At 2.14pm UTC we deployed a new release to Glitch. Normally, we do this many times a day, without incident. In this instance the code we released broke our API servers, which resulted in the sitewide outage. During this time you were unable to access the Glitch site (which is a Glitch application) and your own Glitch projects.
As part of our normal process, we reverted the change and re-deployed. This however, did not result in the API servers recovering. After several hours of diagnosis, we identified a combination of issues, principally a crash loop in our API servers and very aggressive retry logic in our API proxies, that prevented the API servers from recovering. We re-deployed a new fleet of API proxies to address the issue, the API servers returned to normal, and the site began serving traffic again at 9.00pm UTC.
We’re deeply sorry this occurred and are grateful for the patience and support shown by the community. To help understand this outage and what we’ve learned from it, we want to share with you what happened and why it took so long to resolve.
(All times in UTC)
At 2.14pm (UTC) a release was deployed to production. The triggering release was a substantial redesign of our logging module, called ng-logger. This is used by all of our backend services to aggregate and send telemetry to a number of different services. We believe the accidental introduction of a “deep clone” led to a depletion of memory, leaving our API servers unable to process requests.
The release led to an immediate drop in requests being served by our API proxies. Automated alerting via Pingdom and DataDog triggered a page at approximately 2.15pm, 1 minute into the deploy. The page triggers were Pingdom being unable to access the Glitch site, the API and a number of other key sites. After we detected the drop in requests, we began tracking the incident. Our Infrastructure team began to diagnose the problem and our Support team notified users of an issue.
We identified the immediate culprit as the recent deploy. We reverted the deploy and this new deployment was completed by 2.41pm. Unfortunately, rather than resolving the issue as we had hoped, the API servers remained locked.
We then began to diagnose the issue further. Our diagnosis was hindered by some gaps in our telemetry. We have begun to instrument all of our services with distributed traces but the process is not yet complete. This means we have limits in what we can observe outside of our API and API proxy servers.
During our diagnosis we identified several problems that we believe either contributed to the outage, or exacerbated the underlying outage. These included:
- We have very aggressive retry logic in the API proxies, and some related systems, resulting in repeated attempts to contact the API servers. This kept the API servers overloaded and locked up.
- A possible crash loop in the API service due to how we handle Sequelize timeouts.
Once we’d identified these issues, we determined that we needed some way of diverting traffic away from our API proxies to us to spin up a new fleet that wouldn’t become locked down immediately due to the flood of traffic. We made some changes to our load balancing configuration to allow us to create new fleets of API proxies that would start without receiving traffic.
Once these fleets were created, we began to send traffic to them and the API servers began to recover. This recovery began at 8.45pm and full site recovery was achieved at 9.00pm.
We learned a lot during this process. We conducted a blameless retrospective with the involvement of most of the product, design, and engineering teams, as well as a few other folks from across the company. It was our first “at scale” retrospective and we learned a lot about how we will run retrospectives in the future.
We also learned a lot about our incident management process. We’ve grown dramatically in the last year and some people on the team weren’t familiar with the process. Their first exposure to it was during this incident. We’ll be updating and familiarising everyone with the process going forward.
Our existing incident process also didn’t call for a “scribe” role to record the timeline of events concurrently with the incident, meaning we had to reconstruct events and data after the incident. We’ve updated our process to clearly define this role and ensure a timeline is initiated when the incident is declared.
We’re pleased our monitoring identified the issue in a very short period of time and that our (only partially-implemented) observability framework through Honeycomb allowed us to rapidly test and resolve many theories. It’s a priority for us to complete the deployment of this observability framework to additional parts of the platform to remove the current gaps in coverage. This lack of complete coverage resulted in some time as we followed false leads and leading to some frustration as we occasionally ran out of ideas of what to investigate.
We’ve identified we need to work on the retry logic and the timeouts of the API servers and proxies. We’re also going to be looking at our deploy process, which could be more resilient and better designed, especially to allow us to deploy in more staged manner. This might have allowed us to mitigate the issue before it hit the whole fleet.
This outage was upsetting and stressful for our users and for the team. But we learned a great deal and are correcting the issues that allowed the outage to occur and going deeper on our incident management, deploy and other processes to ensure future incidents are handled more efficiently. Thank you for your patience during the outage and we'll see you on Glitch.com!