From 2023-01-16 19:11 UTC to 2023-01-16 20:04 UTC, Doppler experienced a partial outage which prevented sync integrations, webhooks, and activity log notifications from executing. The outage also prevented an internal job from firing which recomputes the version hashes for Doppler configs. This resulted in API clients (e.g. the Doppler CLI and Kubernetes Operator) failing to receive secrets updates which were made during this window.
A recovery migration was run at 2023-01-17 00:37 UTC, re-triggering all syncs, webhooks, and activity log notifications — as well as recomputing config version hashes to restore the functionality for all clients to fetch secret updates.
Doppler uses RabbitMQ to queue jobs which need to be executed as a result of secret updates. On 2023-01-13, Doppler’s security team rotated a RabbitMQ password, mistakenly identifying the credential as unused in production. It took several days for the RabbitMQ sessions in Doppler’s production services to expire and once they did, queue jobs could no longer be published.
Once the incident was identified, Doppler’s security team created new RabbitMQ users to be used by our production services. The change was deployed and the incident was resolved at 2023-01-16 20:04 UTC.
At 2023-01-17 00:37 UTC, Doppler ran a recovery migration to re-fire queue events for sync integrations, webhooks, activity log notifications, and secret version hash recomputations that were meant to fire during the incident window.
Doppler has switched from using a single RabbitMQ credential to using one user per service. RabbitMQ users are now clearly named to mitigate the risk of accidental rotation in the future.
We’ve also identified that the ability for API clients to fetch secrets should not be dependent on our application’s ability to connect to RabbitMQ. Our engineering team will move the config version hash computation to our atomic secrets write operation to ensure that the latest secrets are always fetched by clients.
Lastly, our engineering team is reconfiguring the way we queue asynchronous jobs to ensure that if secrets are modified during a partial infrastructure failure, all post-update jobs will eventually be executed — without the need for manual recovery migrations.