In APIs idempotency is a powerful concept. An idempotent endpoint is one that can be called any number of times while guaranteeing that the side effects will occur only once. In a messy world where clients and servers that may occasionally crash or have their connections drop partway through a request, it’s a huge help in making systems more robust to failure. Clients that are uncertain whether a request succeeded or failed can simply keep retrying it until they get a definitive response.
As we’re about to see in this article, implementing a server so that all requests to it are perfectly idempotent isn’t always easy. For endpoints that get away with only mutating local state in an ACID database, it’s possible to get a robust and simple idempotency implementation by mapping requests to transactions which I wrote about in more detail a few weeks ago. This approach is far easier and less complicated than what’s described here, and I’d suggest that anyone who can get away with it take that path.
Implementations that need to make synchronous changes in foreign state (i.e. outside of a local ACID store) are somewhat more difficult to design. A basic example of this is if an app needs to make a Stripe request to create a charge and needs to know in-band whether it went through so that it can decide whether to proffer some good or service. To guarantee idempotency on this type of endpoint we’ll need to introduce idempotency keys.
An idempotency key is a unique value that’s generated by a client and sent to an API along with a request. The server stores the key to use for bookkeeping the status of that request on its end. If a request should fail partway through, the client retries with the same idempotency key value, and the server uses it to look up the request’s state and continue from where it left off. The name “idempotency key” comes from Stripe’s API.
A common way to transmit an idempotency key is through an HTTP header:
POST /v1/charges
...
Idempotency-Key: 0ccb7813-e63d-4377-93c5-476cb93038f3
...
amount=1000¤cy=usd
Once the server knows that a request has definitively finished by either succeeding or failing in a way that’s not recoverable, it stores the request’s results and associates them with the idempotency key. If a client makes another request with the same key, the server simply short circuits and returns the stored results.
Keys are not meant to be used as a permanent request archive but rather as a mechanism for ensuring near-term correctness. Servers should recycle them out of the system beyond a horizon where they won’t be of much use – say 24 hours or so.
Let’s look at how to design idempotency keys for an API by building a reference implementation.
Our great dev relations team at Stripe has built an app called Rocket Rides to demonstrate the use of the Connect platform and other interesting parts of the API. In Rocket Rides, users who are in a hurry share a ride with a jetpack-certified pilot to get where they’re going fast. SOMA’s gridlock traffic disappears into the distance as they soar free through virgin skies. Travel can be a little more risky than Lyft, so make sure to pack an extra parachute.
The Rocket Rides repository comes with a simple server implementation, but software tends to grow with time, so to be more representative of what a real service with 15 engineers and half a dozen product owners would look like, we’re going to complicate things with a few embellishments.
When a new rides comes in we’ll perform this set of operations:
Our backend implementation will be called from the Rocket Rides mobile app with an idempotency key. If a request fails, the app will continue retrying the operation with the same key, and our job as backend implementers is to make sure that’s safe. We’ll be charging users’ credit cards as part of the request, and we absolutely can’t take the risk of charging them twice.
Most of the time we can expect every one of our Rocket Rides API calls to go swimmingly, and every operation will succeed without a problem. However, when we reach the scale of thousands of API calls a day, we’ll start to notice a few problems appearing here and there; requests failing due to poor cellular connectivity, API calls to Stripe failing occasionally, or bad turbulence caused by moving at supersonic speeds periodically knocking users offline. After we reach the scale of millions of API calls a day, basic probability will dictate that we’ll be seeing these sorts of things happening all the time.
Let’s look at a few examples of things that can go wrong:
Now that we have a premise in place, let’s introduce some ideas that will let us elegantly solve this problem.
To shore up our backend, it’s key to identify where we’re making foreign state mutations; that is, calling out and manipulating data on another system. This might be creating a charge on Stripe, adding a DNS record, or sending an email.
Some foreign state mutations are idempotent by nature (e.g. adding a DNS record), some are not idempotent but can be made idempotent with the help of an idempotency key (e.g. charge on Stripe, sending an email), and some operations are not idempotent, most often because a foreign service hasn’t designed them that way and doesn’t provide a mechanism like an idempotency key.
The reason that the local vs. foreign distinction matters is that unlike a local set of operations where we can leverage an ACID store to roll back a result that we didn’t like, once we make our first foreign state mutation, we’re committed one way or another 1. We’ve pushed data into a system beyond our own boundaries and we shouldn’t lose track of it.
We’re using an API call to Stripe as a common example, but remember that even foreign calls within your own infrastructure count! It’s tempting to treat emitting records to Kafka as part of atomic operations because they have such a high success rate that they feel like they are. They’re not, and should be treated like any other fallible foreign state mutation.
An atomic phase is a set of local state mutations that occur in transactions between foreign state mutations. We say that they’re atomic because we can use an ACID-compliant database like Postgres to guarantee that either all of them will occur, or none will.
Atomic phases should be safely committed before initiating any foreign state mutation. If the call fails, our local state will still have a record of it happening that we can use to retry the operation.
A recovery point is a name of a check point that we get to after having successfully executed any atomic phase or foreign state mutation. Its purpose is to allow a request that’s being retried to jump back to the point in the lifecycle just before the last attempt failed.
For convenience, we’re going to store the name of the
recovery point reached right onto the idempotency key
relation that we’ll build. All requests will initially get
a recovery point of started
, and after any request is
complete (again, through either a success or definitive
error) it’ll be assigned a recovery point of finished
.
When in an atomic phase, the transition to a new recovery
point should be committed as part of that phase’s
transaction.
In-band foreign state mutations make a request slower and more difficult to reason about, so they should be avoided when possible. In many cases it’s possible to defer this type of work to after the request is complete by sending it to a background job queue.
In our Rocket Rides example the charge to Stripe probably can’t be deferred – we want to know whether it succeeded right away so that we can deny the request if it didn’t. Sending an email can and should be sent to the background.
By using a transactionally-staged job drain, we can hide jobs from workers until we’ve confirmed that they’re ready to be worked by isolating them in a transaction. This also means that the background work becomes part of an atomic phase and greatly simplifies its operational properties. Work should always be offloaded to background queues wherever possible.
Now that we’ve covered a few key concepts, we’re ready to shore up Rocket Rides so that it’s resilient against any kind of failure imaginable. Let’s put together the basic schema, break the lifecycle up into atomic phases, and assemble a simple implementation that will recover from failures.
A working version (with testing) of all of this is available in the Atomic Rocket Rides repository. It might be easier to download that code and follow along.
git clone https://github.com/brandur/rocket-rides-atomic.git
Let’s design a Postgres schema for idempotency keys in our app:
CREATE TABLE idempotency_keys (
id BIGSERIAL PRIMARY KEY,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
idempotency_key TEXT NOT NULL
CHECK (char_length(idempotency_key) <= 100),
last_run_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
locked_at TIMESTAMPTZ DEFAULT now(),
-- parameters of the incoming request
request_method TEXT NOT NULL
CHECK (char_length(request_method) <= 10),
request_params JSONB NOT NULL,
request_path TEXT NOT NULL
CHECK (char_length(request_path) <= 100),
-- for finished requests, stored status code and body
response_code INT NULL,
response_body JSONB NULL,
recovery_point TEXT NOT NULL
CHECK (char_length(recovery_point) <= 50),
user_id BIGINT NOT NULL
);
CREATE UNIQUE INDEX idempotency_keys_user_id_idempotency_key
ON idempotency_keys (user_id, idempotency_key);
There are a few notable fields here:
idempotency_key
: This is the user-specified idempotency
key. It’s good practice to send something with good
randomness like a UUID, but not necessarily required. We
constrain the field’s length so that nobody sends us
anything too exotic.
We’ve made idempotency_key
unique, but across
(user_id, idempotency_key)
so that it’s possible to
have the same idempotency key for different requests as
long as it’s across different user accounts.
locked_at
: A field that indicates whether this
idempotency key is actively being worked. The first API
request that creates the key will lock it automatically,
but subsequent retries will also set it to make sure that
they’re the only request doing the work.
params
: The input parameters of the request. This is
stored mostly so that we can error if the user sends two
requests with the same idempotency key but with different
parameters, but can also be used for our own backend to
push unfinished requests to completion (see the
completionist below).
recovery_point
: A text label for the last phase
completed for the idempotent request (see recovery
points above). Gets an initial value
of started
and is set to finished
when the request is
considered to be complete.
Recall our target API lifecycle for Rocket Rides from above.
Let’s bring up Postgres relations for everything else we’ll
need to build this app including audit records, rides, and
users. Given that we aim to maximize reliability, we’ll try
to follow database best practices and use NOT NULL
,
unique, and foreign key constraints wherever we can.
--
-- A relation to hold records for every user of our app.
--
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
email TEXT NOT NULL UNIQUE
CHECK (char_length(email) <= 255),
-- Stripe customer record with an active credit card
stripe_customer_id TEXT NOT NULL UNIQUE
CHECK (char_length(stripe_customer_id) <= 50)
);
--
-- Now that we have a users table, add a foreign key
-- constraint to idempotency_keys which we created above.
--
ALTER TABLE idempotency_keys
ADD CONSTRAINT idempotency_keys_user_id_fkey
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE RESTRICT;
--
-- A relation that hold audit records that can help us piece
-- together exactly what happened in a request if necessary
-- after the fact. It can also, for example, be used to
-- drive internal security programs tasked with looking for
-- suspicious activity.
--
CREATE TABLE audit_records (
id BIGSERIAL PRIMARY KEY,
-- action taken, for example "created"
action TEXT NOT NULL
CHECK (char_length(action) <= 50),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
data JSONB NOT NULL,
origin_ip CIDR NOT NULL,
-- resource ID and type, for example "ride" ID 123
resource_id BIGINT NOT NULL,
resource_type TEXT NOT NULL
CHECK (char_length(resource_type) <= 50),
user_id BIGINT NOT NULL
REFERENCES users ON DELETE RESTRICT
);
--
-- A relation representing a single ride by a user.
-- Notably, it holds the ID of a successful charge to
-- Stripe after we have one.
--
CREATE TABLE rides (
id BIGSERIAL PRIMARY KEY,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
-- Store a reference to the idempotency key so that we can recover an
-- already-created ride. Note that idempotency keys are not stored
-- permanently, so make sure to SET NULL when a referenced key is being
-- reaped.
idempotency_key_id BIGINT
REFERENCES idempotency_keys ON DELETE SET NULL,
-- origin and destination latitudes and longitudes
origin_lat NUMERIC(13, 10) NOT NULL,
origin_lon NUMERIC(13, 10) NOT NULL,
target_lat NUMERIC(13, 10) NOT NULL,
target_lon NUMERIC(13, 10) NOT NULL,
-- ID of Stripe charge like ch_123; NULL until we have one
stripe_charge_id TEXT UNIQUE
CHECK (char_length(stripe_charge_id) <= 50),
user_id BIGINT NOT NULL
REFERENCES users ON DELETE RESTRICT,
CONSTRAINT rides_user_id_idempotency_key_unique UNIQUE (user_id, idempotency_key_id)
);
CREATE INDEX rides_idempotency_key_id
ON rides (idempotency_key_id)
WHERE idempotency_key_id IS NOT NULL;
--
-- A relation that holds our transactionally-staged jobs
-- (see "Background jobs and job staging" above).
--
CREATE TABLE staged_jobs (
id BIGSERIAL PRIMARY KEY,
job_name TEXT NOT NULL,
job_args JSONB NOT NULL
);
Now that we’ve got a feel for what our data should look like, let’s break the API request into distinct atomic phases. These are the basic rules for identifying them:
So in our example, we have an atomic phase for inserting
the idempotency key (tx1
) and another for making our charge
call to Stripe (tx3
) and storing the result. Every other
operation around tx1
and tx3
gets grouped together and
becomes part of two more phases, tx2
and tx4
. tx2
through tx4
can each be reached by a recovery point
that’s set by the transaction that committed before it
(started
, ride_created
, and charge_created
).
Our implementation for an atomic phase will wrap everything in a transaction block (note we’re using Ruby, but this same concept is possible in any language) and give each phase three options for what it can return:
RecoveryPoint
which sets a new recovery point. This
happens within the same transaction as the rest of the
phase so it’s all guaranteed to be atomic. Execution
continues normally into the next phase.Response
which sets the idempotent request’s
recovery point to finished
and returns a response to
the user. This should be used as part of the normal
success condition, but can also be used to return early
with a non-recoverable error. Say for example that a
user’s credit card is not valid – no matter how many
times the request is retried, it will never go through.NoOp
which indicates that program flow should
continue, but that neither a recovery point nor response
should be set.Don’t worry about parsing the specific code too much, but here’s what it might look like:
def atomic_phase(key, &block)
error = false
begin
DB.transaction(isolation: :serializable) do
ret = block.call
if ret.is_a?(NoOp) || ret.is_a?(RecoveryPoint) || ret.is_a?(Response)
ret.call(key)
else
raise "Blocks to #atomic_phase should return one of " \
"NoOp, RecoveryPoint, or Response"
end
end
rescue Sequel::SerializationFailure
# you could possibly retry this error instead
error = true
halt 409, JSON.generate(wrap_error(Messages.error_retry))
rescue
error = true
halt 500, JSON.generate(wrap_error(Messages.error_internal))
ensure
# If we're leaving under an error condition, try to unlock the idempotency
# key right away so that another request can try again.
if error && !key.nil?
begin
key.update(locked_at: nil)
rescue StandardError
# We're already inside an error condition, so swallow any additional
# errors from here and just send them to logs.
puts "Failed to unlock key #{key.id}."
end
end
end
end
# Represents an action to perform a no-op. One possible option for a return
# from an #atomic_phase block.
class NoOp
def call(_key)
# no-op
end
end
# Represents an action to set a new recovery point. One possible option for a
# return from an #atomic_phase block.
class RecoveryPoint
attr_accessor :name
def initialize(name)
self.name = name
end
def call(key)
raise ArgumentError, "key must be provided" if key.nil?
key.update(recovery_point: name)
end
end
# Represents an action to set a new API response (which will be stored onto an
# idempotency key). One possible option for a return from an #atomic_phase
# block.
class Response
attr_accessor :data
attr_accessor :status
def initialize(status, data)
self.status = status
self.data = data
end
def call(key)
raise ArgumentError, "key must be provided" if key.nil?
key.update(
locked_at: nil,
recovery_point: RECOVERY_POINT_FINISHED,
response_code: status,
response_body: data
)
end
end
In the case of a serialization error, we return a 409
Conflict
because that almost certainly means that a
concurrent request conflicted with what we were trying to
do. In a real app, you probably want to just retry the
operation right away because there’s a good chance it will
succeed this time.
For other errors we return a 500 Internal Server Error
.
For either type of error, we try to unlock the idempotency
key before finishing so that another request has a chance
to retry with it.
When a new idempotency key value comes into the API, we’re going to create or update a corresponding row that we’ll use to track its progress.
The easiest case is if we’ve never seen the key before. If so, just insert a new row with appropriate values.
If we have seen the key, lock it so that no other
requests that might be operating concurrently also try the
operation. If the key was already locked, return a 409
Conflict
to indicate that to the user.
A key that’s already set to finished
is simply allowed to
fall through and have its response return on the standard
success path. We’ll see that in just a moment.
key = nil
atomic_phase(key) do
key = IdempotencyKey.first(user_id: user.id, idempotency_key: key_val)
if key
# Programs sending multiple requests with different parameters but the
# same idempotency key is a bug.
if key.request_params != params
halt 409, JSON.generate(wrap_error(Messages.error_params_mismatch))
end
# Only acquire a lock if the key is unlocked or its lock has expired
# because the original request was long enough ago.
if key.locked_at && key.locked_at > Time.now - IDEMPOTENCY_KEY_LOCK_TIMEOUT
halt 409, JSON.generate(wrap_error(Messages.error_request_in_progress))
end
# Lock the key and update latest run unless the request is already
# finished.
if key.recovery_point != RECOVERY_POINT_FINISHED
key.update(last_run_at: Time.now, locked_at: Time.now)
end
else
key = IdempotencyKey.create(
idempotency_key: key_val,
locked_at: Time.now,
recovery_point: RECOVERY_POINT_STARTED,
request_method: request.request_method,
request_params: Sequel.pg_jsonb(params),
request_path: request.path_info,
user_id: user.id,
)
end
# no response and no need to set a recovery point
NoOp.new
end
At first glance this code might not look like it’s safe
from having two concurrent requests come in in close
succession and try to the lock the same key, but it is
because the atomic phase is wrapped in a SERIALIZABLE
transaction. If two different transactions both try to lock
any one key, one of them will be aborted by Postgres.
We’re going to implement the rest of the API request as a simple state machines whose states are a directed acyclic graph (DAG). Unlike a normal graph, a DAG moves only in one direction and never cycles back on itself.
Each atomic phase will be activated from a recovery point,
which was either read from a recovered idempotency key, or
set by the previous atomic phase. We continue to move
through phases until reaching a finished
state, upon
which the loop is broken and a response is sent back to the
user.
An idempotency key that was already finished will enter the loop, break immediately, and send back whatever response was stored onto it.
loop do
case key.recovery_point
when RECOVERY_POINT_STARTED
atomic_phase(key) do
...
end
when RECOVERY_POINT_RIDE_CREATED
atomic_phase(key) do
...
end
when RECOVERY_POINT_CHARGE_CREATED
atomic_phase(key) do
....
end
when RECOVERY_POINT_FINISHED
break
else
raise "Bug! Unhandled recovery point '#{key.recovery_point}'."
end
# If we got here, allow the loop to move us onto the next phase of the
# request. Finished requests will break the loop.
end
[key.response_code, JSON.generate(key.response_body)]
The second phase (tx2
in the diagram above) is simple:
create a record for the ride in our local database, insert
an audit record, and set a new recovery point to
ride_created
.
atomic_phase(key) do
ride = Ride.create(
idempotency_key_id: key.id,
origin_lat: params["origin_lat"],
origin_lon: params["origin_lon"],
target_lat: params["target_lat"],
target_lon: params["target_lon"],
stripe_charge_id: nil, # no charge created yet
user_id: user.id,
)
# in the same transaction insert an audit record for what happened
AuditRecord.insert(
action: AUDIT_RIDE_CREATED,
data: Sequel.pg_jsonb(params),
origin_ip: request.ip,
resource_id: ride.id,
resource_type: "ride",
user_id: user.id,
)
RecoveryPoint.new(RECOVERY_POINT_RIDE_CREATED)
end
With basic records in place, it’s time to try our foreign
state mutation by trying to charge the customer via Stripe.
Here we initiate a charge for $20 using a Stripe customer
ID that was already stored on their user record. On
success, update the ride created in the last step with the
new Stripe charge ID and set recovery point
charge_created
.
atomic_phase(key) do
# retrieve a ride record if necessary (i.e. we're recovering)
ride = Ride.first(idempotency_key_id: key.id) if ride.nil?
# if ride is still nil by this point, we have a bug
raise "Bug! Should have ride for key at #{RECOVERY_POINT_RIDE_CREATED}." \
if ride.nil?
raise "Simulated fail with `raise_error` param." if raise_error
# Rocket Rides is still a new service, so during our prototype phase
# we're going to give $20 fixed-cost rides to everyone, regardless of
# distance. We'll implement a better algorithm later to better
# represent the cost in time and jetfuel on the part of our pilots.
begin
charge = Stripe::Charge.create({
amount: 20_00,
currency: "usd",
customer: user.stripe_customer_id,
description: "Charge for ride #{ride.id}",
}, {
# Pass through our own unique ID rather than the value
# transmitted to us so that we can guarantee uniqueness to Stripe
# across all Rocket Rides accounts.
idempotency_key: "rocket-rides-atomic-#{key.id}"
})
rescue Stripe::CardError
# Sets the response on the key and short circuits execution by
# sending execution right to 'finished'.
Response.new(402, wrap_error(Messages.error_payment(error: $!.message)))
rescue Stripe::StripeError
Response.new(503, wrap_error(Messages.error_payment_generic))
else
ride.update(stripe_charge_id: charge.id)
RecoveryPoint.new(RECOVERY_POINT_CHARGE_CREATED)
end
end
The call to Stripe produces a few possibilities for
unrecoverable errors (i.e. an error that no matter how many
times is retried will never see the call succeed). If we
run into one, set the request to finished
and return an
appropriate response. This might occur if the credit card
was invalid or the transaction was otherwise declined by
the payment gateway.
Now that our charge has been persisted, the next step is to send a receipt to the user. Making an external mail call would normally require its own foreign state mutation, but because we’re using a transactionally-staged job drain, we get a guarantee that the operation commits along with the rest of the transaction.
atomic_phase(key) do
StagedJob.insert(
job_name: "send_ride_receipt",
job_args: Sequel.pg_jsonb({
amount: 20_00,
currency: "usd",
user_id: user.id
})
)
Response.new(201, wrap_ok(Messages.ok))
end
The final step is to set a response telling the user that everything worked as expected. We’re done!
Besides the web process running the API, a few others are
needed to make everything work (see Atomic Rocket Ride’s
Procfile
for the full list and the
corresponding implementations in the same repository).
There should be an enqueuer that moves jobs from
staged_jobs
to the job queue after their inserting
transaction has committed. See this article for
details on how to build one, or the
implementation from Atomic Rocket Rides.
One problem with this implementation is we’re reliant on clients to push indeterminate requests (for example, one that might have appeared to be a timeout) to completion. Usually clients are willing to do this because they want to see their requests go through, but there can be cases where a client starts working, never quite finishes, and drops forever.
A stretch goal is to implement a completer. Its only job is to find requests that look like they never finished to satisfaction and which it looks like clients have dropped, and push through to completion.
It doesn’t even have to have special knowledge about how the stack is implemented. It just needs to know how to read idempotency keys and have a specialized internal authentication path that allows it to retry anyone’s request.
See the Atomic Rocket Rides repository for a completer implementation.
Idempotency keys are meant to act as a mechanism for guaranteeing idempotence, and not as a permanent archive of historical requests. After some amount of time a reaper process should go through keys and delete them.
I’d suggest a threshold of about 72 hours so that even if a bug is deployed on Friday that errors a large number of valid requests, an app could still keep a record of them throughout the weekend and onto Monday where a developer would have a chance to commit a fix and have the completer push them through to success.
An ideal reaper might even notice requests that could not be finished successfully and try to do some cleanup on them. If cleanup is difficult or impossible, it should put them in a list somewhere so that a human can find out what failed.
See the Atomic Rocket Rides repository for a reaper implementation.
Now that we have all the pieces in place, let’s assume the truth of Murphy’s Law and imagine some scenarios that could go wrong while a client app is talking to the new Atomic Rocket Rides backend:
The client makes a request, but the connection breaks before it reaches the backend: The client, having used an idempotency key, knows that retries are safe and so retries. The next attempt succeeds.
Two requests try to create an idempotency key at the
same time: A UNIQUE
constraint in the database
guarantees that only one request can succeed. One goes
through, and the other gets a 409 Conflict
.
An idempotency key is created, but the database goes down and it fails soon after: The client continues to retry against the API until it comes back online. Once it does, the created key is recovered and the request is continued.
Stripe is down: The atomic phase containing the Stripe request fails, and the API responds with an error that tells the client to retry. They continue to do so until Stripe comes back online and the charge succeeds.
A server process dies while waiting for a response from Stripe: Luckily, the call to Stripe was also made with its own idempotency key. The client retries and a new call to Stripe is invoked with the same key. Stripe’s own idempotency guarantees ensure that we haven’t double-charged our user.
A bad deploy 500s all requests midway through: Developers scramble and deploy a fix for the bug. After it’s out, clients retry and the original requests succeed along the newly bug-free path. If the fix took so long to get out that clients have long since gone away, then the completer process pushes them through.
Our care around implementing a failsafe design has paid off – the system is safe despite a wide variety of possible failures.
If we know that a foreign state mutation is an idempotent operation or it supports an idempotency key (like Stripe does), we know that it’s safe to retry any failures that we see.
Unfortunately, not every service will make this guarantee. If we try to make a non-idempotent foreign state mutation and we see a failure, we may have to persist this operation as permanently errored. In many cases we won’t know whether it’s safe to retry or not, and we’ll have to take the conservative route and fail the operation.
The exception is if we got an error back from the non-idempotent API, but one that tell us explicitly that it’s okay to retry. Indeterminate errors like a connection reset or timeout will have to be marked as failed.
This is why you should implement idempotency and/or idempotency keys on all your services!
It’s worth mentioning that none of this is possible on a non-ACID store like MongoDB. Without transactional semantics a database can’t ever guarantee that any two operations commit atomically – every operation against your database becomes equivalent to a foreign state mutation because the notion of an atomic phase is impossible.
This article focuses heavily on APIs, but note that this same technique is reusable for other software as well. A common problem in web apps is double form submission. A user clicking the “Submit” button twice in quick succession may initiate two separate HTTP calls, and in cases where submissions have non-idempotent side effects (e.g. charging the user) this is a problem.
When rendering the form initially, we can add a <input
type="hidden">
to it that contains an idempotency key.
This value will stay the same across multiple submissions,
and the server can use it to dedup the request.
API backends should aim to be passively safe – no matter what kind of failures are thrown at them they’ll end up in a stable state, and users are never left broken even in the most extreme cases. From there, active mechanisms can drive the system towards perfect cohesion. Ideally, human operators never have to intervene to fix things (or at least as infrequently as possible).
Purely idempotent transactions and the idempotency keys with atomic phases described here are two ways to move in that direction. Failures are not only understood to be possible, but are expected, and enough thought has been applied to the system’s design that we know it’ll tolerate failure cleanly no matter what happens.
1 There is one caveat that it may be possible to implement two-phase commit between a system and all other systems where it performs foreign state mutations. This would allow distributed rollbacks, but is complex and time-consuming enough to implement that it’s rarely seen with any kind of ubiquity in real software environments.
Did I make a mistake? Please consider sending a pull request.