The 500 test

April 13, 2018

I’ve been thinking about software resilience lately, and especially where systems that lack resiliency went wrong. It’s a problem that lands squarely in the bucket of “things that are tractable early on, but nearly impossible to fix later”, which in software, is a big one.

In a typical online service, most of us expect upwards of 99.9% of the calls it processes to be fulfilled successfully. However, there will always be some number off in the tail lost to an accidental bug, a bad deploy, an external service that fails to respond, or a database failure or timeout. These are often surfaced to users as an HTTP 5xx status code.

It’s those edges where engineering time is lost. For services where every call is critical and must resolve, but which don’t bake in strong rules around consistency, state must be laboriously repaired by hand – at incredible expense.

We should endeavor to not be losing that time. To that end, I propose the 500 test as a guiding operating principle for building online software:

In the event of any internal fault (500), your service should be left in a state that’s consistent, or one that will be eventually consistent, as long as the latter will occur in a fully automated way.

The consistency I’m talking about here is the same one as in ACID. All state, be it in the local service or in foreign ones should be valid, even if the 500 interrupted a request midway between object mutations.

Readonly requests like GETs usually pass the 500 test automatically – no state is modified, so it’s just as consistent after the failed request as before. Mutations are more difficult. To ensure safety, services need to be built with strong transactional consistency. Where in-band invocations of foreign services are necessary, more sophisticated (and more complex) techniques like transactional state machines are appropriate.

Why the emphasis on “automated”? The human time spent on operation and recovery is expensive, and a top factor leeching engineering productivity, second only to widespread intrinsic technical debt. Building systems that are safe at rest (or at least dramatically safer) is the ultimate way of producing engineering leverage to make new things.

Think about the edges, and bake in resiliency from day one. Avoid technologies that make this difficult-to-impossible (e.g., transaction-less data stores). Failure shouldn’t be routine, but it should be handled routinely.

The 500 test was published on April 13, 2018.

Find me on Twitter at @brandur.

Please post comments and discussion to Hacker News.

Did I make a mistake? Please consider sending a pull request.