brandur.org

Implementing Stripe-like Idempotency Keys in Postgres

In APIs idempotency is a powerful concept. An idempotent endpoint is one that can be called any number of times while guaranteeing that the side effects will occur only once. In a messy world where clients and servers that may occasionally crash or have their connections drop partway through a request, it’s a huge help in making systems more robust to failure. Clients that are uncertain whether a request succeeded or failed can simply keep retrying it until they get a definitive response.

As we’re about to see in this article, implementing a server so that all requests to it are perfectly idempotent isn’t always easy. For endpoints that get away with only mutating local state in an ACID database, it’s possible to get a robust and simple idempotency implementation by mapping requests to transactions which I wrote about in more detail a few weeks ago. This approach is far easier and less complicated than what’s described here, and I’d suggest that anyone who can get away with it take that path.

Implementations that need to make synchronous changes in foreign state (i.e. outside of a local ACID store) are somewhat more difficult to design. A basic example of this is if an app needs to make a Stripe request to create a charge and needs to know in-band whether it went through so that it can decide whether to proffer some good or service. To guarantee idempotency on this type of endpoint we’ll need to introduce idempotency keys.

An idempotency key is a unique value that’s generated by a client and sent to an API along with a request. The server stores the key to use for bookkeeping the status of that request on its end. If a request should fail partway through, the client retries with the same idempotency key value, and the server uses it to look up the request’s state and continue from where it left off. The name “idempotency key” comes from Stripe’s API.

A common way to transmit an idempotency key is through an HTTP header:

POST /v1/charges

...
Idempotency-Key: 0ccb7813-e63d-4377-93c5-476cb93038f3
...

amount=1000&currency=usd

Once the server knows that a request has definitively finished by either succeeding or failing in a way that’s not recoverable, it stores the request’s results and associates them with the idempotency key. If a client makes another request with the same key, the server simply short circuits and returns the stored results.

Keys are not meant to be used as a permanent request archive but rather as a mechanism for ensuring near-term correctness. Servers should recycle them out of the system beyond a horizon where they won’t be of much use – say 24 hours or so.

Let’s look at how to design idempotency keys for an API by building a reference implementation.

Our great dev relations team at Stripe has built an app called Rocket Rides to demonstrate the use of the Connect platform and other interesting parts of the API. In Rocket Rides, users who are in a hurry share a ride with a jetpack-certified pilot to get where they’re going fast. SOMA’s gridlock traffic disappears into the distance as they soar free through virgin skies. Travel can be a little more risky than Lyft, so make sure to pack an extra parachute.

Rocket Rides the app.
Rocket Rides the app.

The Rocket Rides repository comes with a simple server implementation, but software tends to grow with time, so to be more representative of what a real service with 15 engineers and half a dozen product owners would look like, we’re going to complicate things with a few embellishments.

When a new rides comes in we’ll perform this set of operations:

  1. Insert an idempotency key record.
  2. Create a ride record to track the ride that’s about to happen.
  3. Create an audit record referencing the ride.
  4. Make an API call to Stripe to charge the user for the ride (here we’re leaving our own stack, and this presents some risk).
  5. Update the ride record with the created charge ID.
  6. Send the user a receipt via email.
  7. Update idempotency key with results.
A typical API request to our embellished Rocket Rides backend.
A typical API request to our embellished Rocket Rides backend.

Our backend implementation will be called from the Rocket Rides mobile app with an idempotency key. If a request fails, the app will continue retrying the operation with the same key, and our job as backend implementers is to make sure that’s safe. We’ll be charging users’ credit cards as part of the request, and we absolutely can’t take the risk of charging them twice.

Most of the time we can expect every one of our Rocket Rides API calls to go swimmingly, and every operation will succeed without a problem. However, when we reach the scale of thousands of API calls a day, we’ll start to notice a few problems appearing here and there; requests failing due to poor cellular connectivity, API calls to Stripe failing occasionally, or bad turbulence caused by moving at supersonic speeds periodically knocking users offline. After we reach the scale of millions of API calls a day, basic probability will dictate that we’ll be seeing these sorts of things happening all the time.

Let’s look at a few examples of things that can go wrong:

  • Inserting the idempotency key or ride record could fail due to a constraint violation or a database connectivity problem.
  • Our call to Stripe could time out, leaving it unclear whether our charge went through or not.
  • Contacting Mailgun to send the receipt could fail, leaving the user with a credit card charge but no formal notification of the transaction.
  • The client could disconnect as they’re transmitting a request to the server, cancelling the operation midway through.

Now that we have a premise in place, let’s introduce some ideas that will let us elegantly solve this problem.

To shore up our backend, it’s key to identify where we’re making foreign state mutations; that is, calling out and manipulating data on another system. This might be creating a charge on Stripe, adding a DNS record, or sending an email.

Some foreign state mutations are idempotent by nature (e.g. adding a DNS record), some are not idempotent but can be made idempotent with the help of an idempotency key (e.g. charge on Stripe, sending an email), and some operations are not idempotent, most often because a foreign service hasn’t designed them that way and doesn’t provide a mechanism like an idempotency key.

The reason that the local vs. foreign distinction matters is that unlike a local set of operations where we can leverage an ACID store to roll back a result that we didn’t like, once we make our first foreign state mutation, we’re committed one way or another 1. We’ve pushed data into a system beyond our own boundaries and we shouldn’t lose track of it.

We’re using an API call to Stripe as a common example, but remember that even foreign calls within your own infrastructure count! It’s tempting to treat emitting records to Kafka as part of atomic operations because they have such a high success rate that they feel like they are. They’re not, and should be treated like any other fallible foreign state mutation.

An atomic phase is a set of local state mutations that occur in transactions between foreign state mutations. We say that they’re atomic because we can use an ACID-compliant database like Postgres to guarantee that either all of them will occur, or none will.

Atomic phases should be safely committed before initiating any foreign state mutation. If the call fails, our local state will still have a record of it happening that we can use to retry the operation.

A recovery point is a name of a check point that we get to after having successfully executed any atomic phase or foreign state mutation. Its purpose is to allow a request that’s being retried to jump back to the point in the lifecycle just before the last attempt failed.

For convenience, we’re going to store the name of the recovery point reached right onto the idempotency key relation that we’ll build. All requests will initially get a recovery point of started, and after any request is complete (again, through either a success or definitive error) it’ll be assigned a recovery point of finished. When in an atomic phase, the transition to a new recovery point should be committed as part of that phase’s transaction.

In-band foreign state mutations make a request slower and more difficult to reason about, so they should be avoided when possible. In many cases it’s possible to defer this type of work to after the request is complete by sending it to a background job queue.

In our Rocket Rides example the charge to Stripe probably can’t be deferred – we want to know whether it succeeded right away so that we can deny the request if it didn’t. Sending an email can and should be sent to the background.

By using a transactionally-staged job drain, we can hide jobs from workers until we’ve confirmed that they’re ready to be worked by isolating them in a transaction. This also means that the background work becomes part of an atomic phase and greatly simplifies its operational properties. Work should always be offloaded to background queues wherever possible.

Now that we’ve covered a few key concepts, we’re ready to shore up Rocket Rides so that it’s resilient against any kind of failure imaginable. Let’s put together the basic schema, break the lifecycle up into atomic phases, and assemble a simple implementation that will recover from failures.

A working version (with testing) of all of this is available in the Atomic Rocket Rides repository. It might be easier to download that code and follow along.

git clone https://github.com/brandur/rocket-rides-atomic.git

Let’s design a Postgres schema for idempotency keys in our app:

CREATE TABLE idempotency_keys (
    id              BIGSERIAL   PRIMARY KEY,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    idempotency_key TEXT        NOT NULL
        CHECK (char_length(idempotency_key) <= 100),
    last_run_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    locked_at       TIMESTAMPTZ DEFAULT now(),

    -- parameters of the incoming request
    request_method  TEXT        NOT NULL
        CHECK (char_length(request_method) <= 10),
    request_params  JSONB       NOT NULL,
    request_path    TEXT        NOT NULL
        CHECK (char_length(request_path) <= 100),

    -- for finished requests, stored status code and body
    response_code   INT         NULL,
    response_body   JSONB       NULL,

    recovery_point  TEXT        NOT NULL
        CHECK (char_length(recovery_point) <= 50),
    user_id         BIGINT      NOT NULL
);

CREATE UNIQUE INDEX idempotency_keys_user_id_idempotency_key
    ON idempotency_keys (user_id, idempotency_key);

There are a few notable fields here:

  • idempotency_key: This is the user-specified idempotency key. It’s good practice to send something with good randomness like a UUID, but not necessarily required. We constrain the field’s length so that nobody sends us anything too exotic.

    We’ve made idempotency_key unique, but across (user_id, idempotency_key) so that it’s possible to have the same idempotency key for different requests as long as it’s across different user accounts.

  • locked_at: A field that indicates whether this idempotency key is actively being worked. The first API request that creates the key will lock it automatically, but subsequent retries will also set it to make sure that they’re the only request doing the work.

  • params: The input parameters of the request. This is stored mostly so that we can error if the user sends two requests with the same idempotency key but with different parameters, but can also be used for our own backend to push unfinished requests to completion (see the completionist below).

  • recovery_point: A text label for the last phase completed for the idempotent request (see recovery points above). Gets an initial value of started and is set to finished when the request is considered to be complete.

Recall our target API lifecycle for Rocket Rides from above.

A typical API request to our embellished Rocket Rides backend.
A typical API request to our embellished Rocket Rides backend.

Let’s bring up Postgres relations for everything else we’ll need to build this app including audit records, rides, and users. Given that we aim to maximize reliability, we’ll try to follow database best practices and use NOT NULL, unique, and foreign key constraints wherever we can.

--
-- A relation to hold records for every user of our app.
--
CREATE TABLE users (
    id                 BIGSERIAL       PRIMARY KEY,
    email              TEXT            NOT NULL UNIQUE
        CHECK (char_length(email) <= 255),

    -- Stripe customer record with an active credit card
    stripe_customer_id TEXT            NOT NULL UNIQUE
        CHECK (char_length(stripe_customer_id) <= 50)
);

--
-- Now that we have a users table, add a foreign key
-- constraint to idempotency_keys which we created above.
--
ALTER TABLE idempotency_keys
    ADD CONSTRAINT idempotency_keys_user_id_fkey
    FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE RESTRICT;

--
-- A relation that hold audit records that can help us piece
-- together exactly what happened in a request if necessary
-- after the fact. It can also, for example, be used to
-- drive internal security programs tasked with looking for
-- suspicious activity.
--
CREATE TABLE audit_records (
    id                 BIGSERIAL       PRIMARY KEY,

    -- action taken, for example "created"
    action             TEXT            NOT NULL
        CHECK (char_length(action) <= 50),

    created_at         TIMESTAMPTZ     NOT NULL DEFAULT now(),
    data               JSONB           NOT NULL,
    origin_ip          CIDR            NOT NULL,

    -- resource ID and type, for example "ride" ID 123
    resource_id        BIGINT          NOT NULL,
    resource_type      TEXT            NOT NULL
        CHECK (char_length(resource_type) <= 50),

    user_id            BIGINT          NOT NULL
        REFERENCES users ON DELETE RESTRICT
);

--
-- A relation representing a single ride by a user.
-- Notably, it holds the ID of a successful charge to
-- Stripe after we have one.
--
CREATE TABLE rides (
    id                 BIGSERIAL       PRIMARY KEY,
    created_at         TIMESTAMPTZ     NOT NULL DEFAULT now(),

    -- Store a reference to the idempotency key so that we can recover an
    -- already-created ride. Note that idempotency keys are not stored
    -- permanently, so make sure to SET NULL when a referenced key is being
    -- reaped.
    idempotency_key_id BIGINT
        REFERENCES idempotency_keys ON DELETE SET NULL,

    -- origin and destination latitudes and longitudes
    origin_lat         NUMERIC(13, 10) NOT NULL,
    origin_lon         NUMERIC(13, 10) NOT NULL,
    target_lat         NUMERIC(13, 10) NOT NULL,
    target_lon         NUMERIC(13, 10) NOT NULL,

    -- ID of Stripe charge like ch_123; NULL until we have one
    stripe_charge_id   TEXT            UNIQUE
        CHECK (char_length(stripe_charge_id) <= 50),

    user_id            BIGINT          NOT NULL
        REFERENCES users ON DELETE RESTRICT,
    CONSTRAINT rides_user_id_idempotency_key_unique UNIQUE (user_id, idempotency_key_id)
);

CREATE INDEX rides_idempotency_key_id
    ON rides (idempotency_key_id)
    WHERE idempotency_key_id IS NOT NULL;

--
-- A relation that holds our transactionally-staged jobs
-- (see "Background jobs and job staging" above).
--
CREATE TABLE staged_jobs (
    id                 BIGSERIAL       PRIMARY KEY,
    job_name           TEXT            NOT NULL,
    job_args           JSONB           NOT NULL
);

Now that we’ve got a feel for what our data should look like, let’s break the API request into distinct atomic phases. These are the basic rules for identifying them:

  1. Upserting the idempotency key gets its own atomic phase.
  2. Every foreign state mutation gets its own atomic phase.
  3. After those phases have been identified, all other operations between them are grouped into atomic phases. Even if there are 100 operations against an ACID database between two foreign state mutations, they can all safely belong to the same phase.

So in our example, we have an atomic phase for inserting the idempotency key (tx1) and another for making our charge call to Stripe (tx3) and storing the result. Every other operation around tx1 and tx3 gets grouped together and becomes part of two more phases, tx2 and tx4. tx2 through tx4 can each be reached by a recovery point that’s set by the transaction that committed before it (started, ride_created, and charge_created).

API request to Rocket Rides broken into foreign state mutations and atomic phases.
API request to Rocket Rides broken into foreign state mutations and atomic phases.

Our implementation for an atomic phase will wrap everything in a transaction block (note we’re using Ruby, but this same concept is possible in any language) and give each phase three options for what it can return:

  1. A RecoveryPoint which sets a new recovery point. This happens within the same transaction as the rest of the phase so it’s all guaranteed to be atomic. Execution continues normally into the next phase.
  2. A Response which sets the idempotent request’s recovery point to finished and returns a response to the user. This should be used as part of the normal success condition, but can also be used to return early with a non-recoverable error. Say for example that a user’s credit card is not valid – no matter how many times the request is retried, it will never go through.
  3. A NoOp which indicates that program flow should continue, but that neither a recovery point nor response should be set.

Don’t worry about parsing the specific code too much, but here’s what it might look like:

def atomic_phase(key, &block)
  error = false
  begin
    DB.transaction(isolation: :serializable) do
      ret = block.call

      if ret.is_a?(NoOp) || ret.is_a?(RecoveryPoint) || ret.is_a?(Response)
        ret.call(key)
      else
        raise "Blocks to #atomic_phase should return one of " \
          "NoOp, RecoveryPoint, or Response"
      end
    end
  rescue Sequel::SerializationFailure
    # you could possibly retry this error instead
    error = true
    halt 409, JSON.generate(wrap_error(Messages.error_retry))
  rescue
    error = true
    halt 500, JSON.generate(wrap_error(Messages.error_internal))
  ensure
    # If we're leaving under an error condition, try to unlock the idempotency
    # key right away so that another request can try again.
    if error && !key.nil?
      begin
        key.update(locked_at: nil)
      rescue StandardError
        # We're already inside an error condition, so swallow any additional
        # errors from here and just send them to logs.
        puts "Failed to unlock key #{key.id}."
      end
    end
  end
end

# Represents an action to perform a no-op. One possible option for a return
# from an #atomic_phase block.
class NoOp
  def call(_key)
    # no-op
  end
end

# Represents an action to set a new recovery point. One possible option for a
# return from an #atomic_phase block.
class RecoveryPoint
  attr_accessor :name

  def initialize(name)
    self.name = name
  end

  def call(key)
    raise ArgumentError, "key must be provided" if key.nil?
    key.update(recovery_point: name)
  end
end

# Represents an action to set a new API response (which will be stored onto an
# idempotency key). One  possible option for a return from an #atomic_phase
# block.
class Response
  attr_accessor :data
  attr_accessor :status

  def initialize(status, data)
    self.status = status
    self.data = data
  end

  def call(key)
    raise ArgumentError, "key must be provided" if key.nil?
    key.update(
      locked_at: nil,
      recovery_point: RECOVERY_POINT_FINISHED,
      response_code: status,
      response_body: data
    )
  end
end

In the case of a serialization error, we return a 409 Conflict because that almost certainly means that a concurrent request conflicted with what we were trying to do. In a real app, you probably want to just retry the operation right away because there’s a good chance it will succeed this time.

For other errors we return a 500 Internal Server Error. For either type of error, we try to unlock the idempotency key before finishing so that another request has a chance to retry with it.

When a new idempotency key value comes into the API, we’re going to create or update a corresponding row that we’ll use to track its progress.

The easiest case is if we’ve never seen the key before. If so, just insert a new row with appropriate values.

If we have seen the key, lock it so that no other requests that might be operating concurrently also try the operation. If the key was already locked, return a 409 Conflict to indicate that to the user.

A key that’s already set to finished is simply allowed to fall through and have its response return on the standard success path. We’ll see that in just a moment.

key = nil

atomic_phase(key) do
  key = IdempotencyKey.first(user_id: user.id, idempotency_key: key_val)

  if key
    # Programs sending multiple requests with different parameters but the
    # same idempotency key is a bug.
    if key.request_params != params
      halt 409, JSON.generate(wrap_error(Messages.error_params_mismatch))
    end

    # Only acquire a lock if the key is unlocked or its lock has expired
    # because the original request was long enough ago.
    if key.locked_at && key.locked_at > Time.now - IDEMPOTENCY_KEY_LOCK_TIMEOUT
      halt 409, JSON.generate(wrap_error(Messages.error_request_in_progress))
    end

    # Lock the key and update latest run unless the request is already
    # finished.
    if key.recovery_point != RECOVERY_POINT_FINISHED
      key.update(last_run_at: Time.now, locked_at: Time.now)
    end
  else
    key = IdempotencyKey.create(
      idempotency_key: key_val,
      locked_at:       Time.now,
      recovery_point:  RECOVERY_POINT_STARTED,
      request_method:  request.request_method,
      request_params:  Sequel.pg_jsonb(params),
      request_path:    request.path_info,
      user_id:         user.id,
    )
  end

  # no response and no need to set a recovery point
  NoOp.new
end

At first glance this code might not look like it’s safe from having two concurrent requests come in in close succession and try to the lock the same key, but it is because the atomic phase is wrapped in a SERIALIZABLE transaction. If two different transactions both try to lock any one key, one of them will be aborted by Postgres.

We’re going to implement the rest of the API request as a simple state machines whose states are a directed acyclic graph (DAG). Unlike a normal graph, a DAG moves only in one direction and never cycles back on itself.

Each atomic phase will be activated from a recovery point, which was either read from a recovered idempotency key, or set by the previous atomic phase. We continue to move through phases until reaching a finished state, upon which the loop is broken and a response is sent back to the user.

An idempotency key that was already finished will enter the loop, break immediately, and send back whatever response was stored onto it.

loop do
  case key.recovery_point
  when RECOVERY_POINT_STARTED
    atomic_phase(key) do
      ...
    end

  when RECOVERY_POINT_RIDE_CREATED
    atomic_phase(key) do
      ...
    end

  when RECOVERY_POINT_CHARGE_CREATED
    atomic_phase(key) do
      ....
    end

  when RECOVERY_POINT_FINISHED
    break

  else
    raise "Bug! Unhandled recovery point '#{key.recovery_point}'."
  end

  # If we got here, allow the loop to move us onto the next phase of the
  # request. Finished requests will break the loop.
end

[key.response_code, JSON.generate(key.response_body)]

The second phase (tx2 in the diagram above) is simple: create a record for the ride in our local database, insert an audit record, and set a new recovery point to ride_created.

atomic_phase(key) do
  ride = Ride.create(
    idempotency_key_id: key.id,
    origin_lat:         params["origin_lat"],
    origin_lon:         params["origin_lon"],
    target_lat:         params["target_lat"],
    target_lon:         params["target_lon"],
    stripe_charge_id:   nil, # no charge created yet
    user_id:            user.id,
  )

  # in the same transaction insert an audit record for what happened
  AuditRecord.insert(
    action:        AUDIT_RIDE_CREATED,
    data:          Sequel.pg_jsonb(params),
    origin_ip:     request.ip,
    resource_id:   ride.id,
    resource_type: "ride",
    user_id:       user.id,
  )

  RecoveryPoint.new(RECOVERY_POINT_RIDE_CREATED)
end

With basic records in place, it’s time to try our foreign state mutation by trying to charge the customer via Stripe. Here we initiate a charge for $20 using a Stripe customer ID that was already stored on their user record. On success, update the ride created in the last step with the new Stripe charge ID and set recovery point charge_created.

atomic_phase(key) do
  # retrieve a ride record if necessary (i.e. we're recovering)
  ride = Ride.first(idempotency_key_id: key.id) if ride.nil?

  # if ride is still nil by this point, we have a bug
  raise "Bug! Should have ride for key at #{RECOVERY_POINT_RIDE_CREATED}." \
    if ride.nil?

  raise "Simulated fail with `raise_error` param." if raise_error

  # Rocket Rides is still a new service, so during our prototype phase
  # we're going to give $20 fixed-cost rides to everyone, regardless of
  # distance. We'll implement a better algorithm later to better
  # represent the cost in time and jetfuel on the part of our pilots.
  begin
    charge = Stripe::Charge.create({
      amount:      20_00,
      currency:    "usd",
      customer:    user.stripe_customer_id,
      description: "Charge for ride #{ride.id}",
    }, {
      # Pass through our own unique ID rather than the value
      # transmitted to us so that we can guarantee uniqueness to Stripe
      # across all Rocket Rides accounts.
      idempotency_key: "rocket-rides-atomic-#{key.id}"
    })
  rescue Stripe::CardError
    # Sets the response on the key and short circuits execution by
    # sending execution right to 'finished'.
    Response.new(402, wrap_error(Messages.error_payment(error: $!.message)))
  rescue Stripe::StripeError
    Response.new(503, wrap_error(Messages.error_payment_generic))
  else
    ride.update(stripe_charge_id: charge.id)
    RecoveryPoint.new(RECOVERY_POINT_CHARGE_CREATED)
  end
end

The call to Stripe produces a few possibilities for unrecoverable errors (i.e. an error that no matter how many times is retried will never see the call succeed). If we run into one, set the request to finished and return an appropriate response. This might occur if the credit card was invalid or the transaction was otherwise declined by the payment gateway.

Now that our charge has been persisted, the next step is to send a receipt to the user. Making an external mail call would normally require its own foreign state mutation, but because we’re using a transactionally-staged job drain, we get a guarantee that the operation commits along with the rest of the transaction.

atomic_phase(key) do
  StagedJob.insert(
    job_name: "send_ride_receipt",
    job_args: Sequel.pg_jsonb({
      amount:   20_00,
      currency: "usd",
      user_id:  user.id
    })
  )
  Response.new(201, wrap_ok(Messages.ok))
end

The final step is to set a response telling the user that everything worked as expected. We’re done!

Besides the web process running the API, a few others are needed to make everything work (see Atomic Rocket Ride’s Procfile for the full list and the corresponding implementations in the same repository).

There should be an enqueuer that moves jobs from staged_jobs to the job queue after their inserting transaction has committed. See this article for details on how to build one, or the implementation from Atomic Rocket Rides.

One problem with this implementation is we’re reliant on clients to push indeterminate requests (for example, one that might have appeared to be a timeout) to completion. Usually clients are willing to do this because they want to see their requests go through, but there can be cases where a client starts working, never quite finishes, and drops forever.

A stretch goal is to implement a completer. Its only job is to find requests that look like they never finished to satisfaction and which it looks like clients have dropped, and push through to completion.

It doesn’t even have to have special knowledge about how the stack is implemented. It just needs to know how to read idempotency keys and have a specialized internal authentication path that allows it to retry anyone’s request.

See the Atomic Rocket Rides repository for a completer implementation.

Idempotency keys are meant to act as a mechanism for guaranteeing idempotence, and not as a permanent archive of historical requests. After some amount of time a reaper process should go through keys and delete them.

I’d suggest a threshold of about 72 hours so that even if a bug is deployed on Friday that errors a large number of valid requests, an app could still keep a record of them throughout the weekend and onto Monday where a developer would have a chance to commit a fix and have the completer push them through to success.

An ideal reaper might even notice requests that could not be finished successfully and try to do some cleanup on them. If cleanup is difficult or impossible, it should put them in a list somewhere so that a human can find out what failed.

See the Atomic Rocket Rides repository for a reaper implementation.

Now that we have all the pieces in place, let’s assume the truth of Murphy’s Law and imagine some scenarios that could go wrong while a client app is talking to the new Atomic Rocket Rides backend:

  • The client makes a request, but the connection breaks before it reaches the backend: The client, having used an idempotency key, knows that retries are safe and so retries. The next attempt succeeds.

  • Two requests try to create an idempotency key at the same time: A UNIQUE constraint in the database guarantees that only one request can succeed. One goes through, and the other gets a 409 Conflict.

  • An idempotency key is created, but the database goes down and it fails soon after: The client continues to retry against the API until it comes back online. Once it does, the created key is recovered and the request is continued.

  • Stripe is down: The atomic phase containing the Stripe request fails, and the API responds with an error that tells the client to retry. They continue to do so until Stripe comes back online and the charge succeeds.

  • A server process dies while waiting for a response from Stripe: Luckily, the call to Stripe was also made with its own idempotency key. The client retries and a new call to Stripe is invoked with the same key. Stripe’s own idempotency guarantees ensure that we haven’t double-charged our user.

  • A bad deploy 500s all requests midway through: Developers scramble and deploy a fix for the bug. After it’s out, clients retry and the original requests succeed along the newly bug-free path. If the fix took so long to get out that clients have long since gone away, then the completer process pushes them through.

Our care around implementing a failsafe design has paid off – the system is safe despite a wide variety of possible failures.

If we know that a foreign state mutation is an idempotent operation or it supports an idempotency key (like Stripe does), we know that it’s safe to retry any failures that we see.

Unfortunately, not every service will make this guarantee. If we try to make a non-idempotent foreign state mutation and we see a failure, we may have to persist this operation as permanently errored. In many cases we won’t know whether it’s safe to retry or not, and we’ll have to take the conservative route and fail the operation.

The exception is if we got an error back from the non-idempotent API, but one that tell us explicitly that it’s okay to retry. Indeterminate errors like a connection reset or timeout will have to be marked as failed.

This is why you should implement idempotency and/or idempotency keys on all your services!

It’s worth mentioning that none of this is possible on a non-ACID store like MongoDB. Without transactional semantics a database can’t ever guarantee that any two operations commit atomically – every operation against your database becomes equivalent to a foreign state mutation because the notion of an atomic phase is impossible.

This article focuses heavily on APIs, but note that this same technique is reusable for other software as well. A common problem in web apps is double form submission. A user clicking the “Submit” button twice in quick succession may initiate two separate HTTP calls, and in cases where submissions have non-idempotent side effects (e.g. charging the user) this is a problem.

When rendering the form initially, we can add a <input type="hidden"> to it that contains an idempotency key. This value will stay the same across multiple submissions, and the server can use it to dedup the request.

API backends should aim to be passively safe – no matter what kind of failures are thrown at them they’ll end up in a stable state, and users are never left broken even in the most extreme cases. From there, active mechanisms can drive the system towards perfect cohesion. Ideally, human operators never have to intervene to fix things (or at least as infrequently as possible).

Purely idempotent transactions and the idempotency keys with atomic phases described here are two ways to move in that direction. Failures are not only understood to be possible, but are expected, and enough thought has been applied to the system’s design that we know it’ll tolerate failure cleanly no matter what happens.

1 There is one caveat that it may be possible to implement two-phase commit between a system and all other systems where it performs foreign state mutations. This would allow distributed rollbacks, but is complex and time-consuming enough to implement that it’s rarely seen with any kind of ubiquity in real software environments.

Did I make a mistake? Please consider sending a pull request.