A critical outage around Restricted API Key authentication occurred Sunday evening at exactly 9:00 PM PST leading to over 1 hour of downtime. This outage did not result in any data loss but did render Restricted API Keys inoperable with our API endpoints for that period of time. This was an unacceptable amount of downtime and could have been completely prevented. We take your trust very seriously as Doppler is a critical path in your devops and productivity workflows. We have learned from this experience while fixing the root cause and adding checks to prevent this kind of outage in the future. Here is what happened:
December 9th, 2019 - 9:00 PM (PST)
Our engineering team rolls out new authentication logic for our API endpoints. The update is designed to increase our defense in depth by adding additional layers of authentication around every layer in our core stack.
December 9th, 2019 - 9:40 PM (PST)
While testing the API endpoints to verify the new logic, we find that all requests utilizing a Restricted API Key are being rejected.
December 9th, 2019 - 10:00 PM (PST)
We identify the culprit bug in our codebase and start working on a patch. The bug is linked to how we handle Restricted API Keys. Doppler’s API offers 3 methods of authentication: Personal Keys, Restricted API Keys (now called Service Tokens), and CLI Tokens. Personal Keys and CLI Tokens are tied to a user identity, while Service Tokens are not. Our investigation finds that our authentication logic was requiring a user identity and did not gracefully handle a case where one would not be present.
December 9th, 2019 - 10:48 PM (PST)
The patch is released to production and the engineering team starts monitoring the fix.
December 9th, 2019 - 11:38 PM (PST)
After stress testing the patched authentication logic in production, we mark the incident resolved.
End to End Testing (e2e)
From unit tests to e2e testing, we strive to test every part of our stack. As it turns out, we had not added e2e tests for Service Tokens. This outage has led us to reassess our test coverage and focus on testing all remaining user flows.
Doppler uses Pingdom to test for uptime and display those results on our status page. Pingdom was testing our health check endpoints which do not require authentication. This led Pingdom to not report our outage. We have now changed Pingdom to test our secrets endpoint, which requires multiple layers of authentication and is a better indicator of our API being serviceable to customers.
Announcing Maintenance Windows
Starting today, we are going to announce major maintenance schedules one week in advance. Our goal with maintenance is to never have any downtime, but when downtime does strike, we want you to be prepared. Knowing a time window in advance can help your engineering teams navigate and prepare for the off chance appropriately.