Today at 2:15pm Pacific we experienced our second outage in the last two weeks. Sadly the timing is not great but it did give us the opportunity to re-evaluate our failure points as we continue to harden our infrastructure. This outage affected all of our endpoints, from our production and failover infrastructure to the documentation hub and status page but did not result in any data loss.
Doppler uses Cloudflare as our DNS provider which provides a suite of powerful features including DDOS protection, a CDN for assets, firewall rules, edge workers, and plenty of others. They are one of the most popular and trusted DNS providers, which supports nearly 20% of all internet traffic. Today they went down, which brought down a portion of the global internet with them.
Cloudflare recommends using their DNS proxy so you can benefit from their suite of features. As we were reminded today, using their proxy changes the landscape of the default protections DNS provides, which results in a nonobvious cost. DNS by its very nature is decentralized, which creates a layer of resilience against being a single point of failure. But this assumption of protection breaks down when you use a proxy at the DNS layer, as now you have a new single point of failure. Today we all paid that nonobvious cost.
Hardening Our DNS Reliability
Internally we are tracking the best path forward towards hardening our DNS’s reliability. This can come in a couple of different forms, such as disabling proxy mode for our DNS records. This would remove our DNS layer as a single point of failure but comes at the cost of losing DDOS protection, our records not being masked, and also some other behind the scenes magic.
Another possible option would be to add an additional DNS provider (that supports DDOS protection) to our stack. Then in the case one goes down, our traffic will automatically failover the other. This would add a fair amount of complexity to our stack.
Sadly all solutions thought of so far have tradeoffs and could have nonobvious consequences. We deeply care about finding the right answer, not the fastest to implement. As we continue to explore and implement solutions, we expect to write about our findings and decisions on our engineering blog.
Being transparent is core to the DNA of the company and we strive to provide our customer's observability during outages in real-time. We do this through our @DopplerHelp Twitter and status page. Being that our status page’s DNS is hosted by Cloudflare, it was also affected by the outage. To prevent this in the future, we are moving our status page’s DNS to another provider and will use a new dedicated domain. This domain is still being configured and will be announced soon.
The Doppler CLI has a nifty command called
doppler run which downloads your secrets from our API and then injects them into your application. After each successful run, we automatically create and store an encrypted snapshot of the secrets for you. On the off chance the CLI is unable to connect to our API, we smartly fallback to this encrypted snapshot after 5 retries.
During the outage, our
doppler run users were unaffected as they had an existing snapshot to fallback to. One area we found that could use a little love is in showcasing of a retry event. If request hangs it can create a visible delay to the user. In the next release, the Doppler CLI will print a message stating a retry event is happening so you always stay informed.
Providing a seamless experience that provides near perfect uptime is an incredibly difficult task that requires deep thought about every layer in the stack. Today we are reminded that our DNS is a single point of failure and that even with the most trusted of services, like Cloudflare, can bring us down if we don’t have multiple layers of redundancies. As we continue to harden our infrastructure, we plan to share our learnings with you through our engineering blog.