Our first critical outage occurred today, and it led to almost 2 hours of downtime but thankfully no data loss. This was an unacceptable amount of downtime and could have been completely prevented. We take your trust very seriously as Doppler is a critical path in your devops and productivity workflows. We have learned greatly from this experience, from fixing the root cause and adding checks to prevent this kind of outage in the future. Here is what happened:
March 6th, 2019 - 3:04 PM (PST)
Heroku starts an automated maintenance on our primary postgres database. This process includes creating a follower of our primary database on the newest postgres version, then hard forking the database and setting it as the new primary database. After the primary database is in use, the old database is removed. We were warned about this migration ahead of time and assumed the maintenance/migration would be quick and all of the credentials would automatically be updated across our environments.
March 6th, 2019 - 3:20 PM (PST)
As our servers start crashing and we receive a flood of Bugsnag error reports. Digging through the stack traces and logging, we realize a failure occurred during the migration where our primary database url was revoked but the new primary database url had not been automatically set as an environment variable. This made our servers attempt connection to an invalid database url which caused our servers to crash. This same failure affected our failover servers which relied on a follower database of our primary database.
Since Doppler relies on Doppler for environment variables, it created a circular loop where we need to be up to boot up. Removing Doppler as a source for environment variables is tricky, as we only stored our environment variables in Doppler. Those environment variables are tokenized with our security vendor so it was impossible for us to copy them from our database into Heroku manually. Instead, we had to go through every service we used to grab or create new credentials through each of their individual dashboards.
March 6th, 2019 - 3:30 PM (PST)
We initiated our recovery plan:
1. Report the incident on StatusPage
2. Put a list together of all the credentials needed
3. Find correct database url
4. Modify our environment variables with credentials
5. Deploy new code with our up-to-date environment variables
March 6th, 2019 - 3:50 PM (PST)
After finding all of our credentials / environment variables we needed, we started inputting them manually into Heroku environments. This had to be done across all of our services for Doppler to work properly.
After setting most of our environment variables, we ran into a critical problem. Heroku would not let us set the new database url because it was a managed variable controlled by Heroku Postgres. In addition, our servers started scaling up rapidly as they were constantly crashing and rebooting, each time hitting our API that was already down. This endless cycle created timeouts and produced a massive number of logs, making it very hard to debug. To counteract this, we tried to disable autoscaling and scaled our servers down to 1 dyno. This process kept returning an error as Heroku would mark it as an invalid request. After 10 minutes, we were able to successfully scale our servers down to 1 dyno.
Now that logging is easier to comb through, we refocused on fixing our postgres invalid credentials problem. We found a buried option in a submenu of the Managed Postgres Dashboard to force a rotation of our database credentials. Next, Heroku immediately propagated those new credentials to our environment variables as we expected.
March 6th, 2019 - 4:15 PM (PST)
Now that all of our correct environment variables are on Heroku, we started step 5. Though the Doppler API was down, our SDK was able to fallback to the environment variables set on Heroku. Quickly after deploying the fix and our website is back up, we realize our security vendor's API url is incorrect from the one on the dashboard, as they had created a dedicated one for us.
March 6th, 2019 - 4:30 PM (PST)
After digging through a hundred old slack messages and talking with support, we found the dedicated API url to use. Now our servers are fully back up and running.
March 6th, 2019 - 4:50 PM (PST)
After 20 minutes of additional monitoring and stress testing, the StatusPage is updated with the indication that the incident is resolved.
Doppler being a critical path for Doppler is a linchpin waiting to be pulled. We needed to have a fallback option built-into our systems so that Doppler can boot up after a full outage. This will be done by ensuring Heroku environment variables are always up to date.
Previous to the outage, our precheck script strictly checked for the presence of all our required environment variables. This is not good enough. Our new precheck script also initializes all libraries and clients (including our database client) to ensure all the credentials are correct.
During the outage, when our database url was revoked, the old prechecks still passed as a url was present. It now ensures authorization ahead of time and done successfully.
A wake up call that we will not be able to run on Heroku for too long. Heroku provides a lot of value for the price tag but comes at a deep expense: lack of devops expertise. Overtime we will migrate to AWS and build out our own devops workflows that can support our extremely high SLA and fault tolerant requirements.