Atomicity (in the sense of “ACID”) states that for a series of operations performed against a database, either every one of them commits together, or they’re all rolled back; no in between states are allowed. For code that needs to be resilient to the messiness of the real world, it’s a godsend.
Instead of bugs that make it to production changing data and then leaving it permanently corrupt, those changes are reverted. The long tail of connections that are dropped midway from intermittent problems and other unexpected states while handling millions of requests might cause inconvenience, but won’t scramble your data.
Postgres’s implementation in particular is known to provide powerful transaction semantics with little overhead. And while I’ve used it for years, it’s never been something that I’ve understood. Postgres works reliably enough that I’ve been able to treat it as a black box – wonderfully useful, but with inner workings that are a mystery.
This article looks into how Postgres keeps the books on its transactions, how they’re committed atomically, and some concepts that are key to understanding it all 1.
Say you build a simple database that reads and writes from an on-disk CSV file. When a single client comes in with a request, it opens the file, reads some information, and writes the changes back. Things are mostly working fine, but then one day you decide to enhance your database with a sophisticated new feature, multi-client support!
Unfortunately, the new implementation is immediately plagued by problems that seem to especially apparent when two clients are trying to access data around the same time. One opens the CSV file, reads, modifies, and writes some data, but that change is immediately clobbered by another client trying to do the same.
This is a problem of concurrent access and it’s addressed by introducing concurrency control. There are plenty of naive solutions. We could ensure that any process takes out an exclusive lock on a file before reading or writing it, or we could push all operations through a single flow control point so that they only run one at a time. Not only are these workarounds slow, but they won’t scale up to allow us to make our database fully ACID-compliant. Modern databases have a better way, MVCC (multi-version concurrency control).
Under MVCC, statements execute inside of a transaction, and instead of overwriting data directly, they create new versions of it. The original data is still available to other clients that might need it, and any new data stays hidden until the transaction commits. Clients are no longer in direct contention, and data stays safely persisted because they’re not overwriting each other’s changes.
When a transaction starts, it takes a snapshot that captures the state of a database at that moment in time. Every transaction in the database is applied in serial order, with a global lock ensuring that only one is being confirmed committed or aborted at a time. A snapshot is a perfect representation of the database’s state between two transactions.
To avoid the neverending accumulation of rows that have been deleted and hidden, databases will eventually remove obsolete data by way of a vacuum process (or in some cases, opportunistic “microvacuums” that happen in band with other queries), but they’ll only do so for information that’s no longer needed by open snapshots.
Postgres manages concurrent access with MVCC. Lets take a look at how it works.
Here’s the data structure that Postgres uses to represent a transaction (from proc.c):
typedef struct PGXACT
{
TransactionId xid; /* id of top-level transaction currently being
* executed by this proc, if running and XID
* is assigned; else InvalidTransactionId */
TransactionId xmin; /* minimal running XID as it was when we were
* starting our xact, excluding LAZY VACUUM:
* vacuum must not remove tuples deleted by
* xid >= xmin ! */
...
} PGXACT;
Transactions are identified with a xid
(transaction, or
“xact” ID). As an optimization, Postgres will only assign a
transaction a xid
if it starts to modify data because
it’s only at that point where other processes need to start
tracking its changes. Readonly transactions can execute
happily without ever needing a xid
.
xmin
is always set immediately to the smallest xid
of
any transactions that are still running when this one
starts. Vacuum processes calculate the minimum boundary of
data that they need to keep by taking the minimum of the
xmin
s of all active transactions
Rows of data in Postgres are often referred to as
tuples. While Postgres uses common lookup structures
like B-trees to make retrievals fast, indexes don’t store a
tuple’s full set of data or any of its visibility
information. Instead, they store a tid
(tuple ID) that
can be used to retrieve a row from physical storage,
otherwise known as “the heap”. The tid
gives Postgres a
starting point where it can start scanning the heap until
it finds a tuple that satisfies the current snapshot’s
visibility.
Here’s the Postgres implementation for a heap tuple (as
opposed to an index tuple which is the structure found in
an index), along with a few other structs that represent
its header information (from htup.h
and
htup_details.h
):
typedef struct HeapTupleData
{
uint32 t_len; /* length of *t_data */
ItemPointerData t_self; /* SelfItemPointer */
Oid t_tableOid; /* table the tuple came from */
HeapTupleHeader t_data; /* -> tuple header and data */
} HeapTupleData;
/* referenced by HeapTupleData */
struct HeapTupleHeaderData
{
HeapTupleFields t_heap;
...
}
/* referenced by HeapTupleHeaderData */
typedef struct HeapTupleFields
{
TransactionId t_xmin; /* inserting xact ID */
TransactionId t_xmax; /* deleting or locking xact ID */
...
} HeapTupleFields;
Like a transaction, a tuple tracks its own xmin
, except
in the tuple’s case it’s recorded to represent the first
transaction where the tuple becomes visible (i.e. the one
that created it). It also tracks xmax
to be the last
transaction where the tuple is visible (i.e. the one that
deleted it) 2.
xmin
and xmax
are internal concepts, but they can be
revealed as hidden columns on any Postgres table. Just
select them explicitly by name:
# SELECT *, xmin, xmax FROM names;
id | name | xmin | xmax
----+----------+-------+-------
1 | Hyperion | 27926 | 27928
2 | Endymion | 27927 | 0
Here’s the snapshot structure (from snapshot.h):
typedef struct SnapshotData
{
/*
* The remaining fields are used only for MVCC snapshots, and are normally
* just zeroes in special snapshots. (But xmin and xmax are used
* specially by HeapTupleSatisfiesDirty.)
*
* An MVCC snapshot can never see the effects of XIDs >= xmax. It can see
* the effects of all older XIDs except those listed in the snapshot. xmin
* is stored as an optimization to avoid needing to search the XID arrays
* for most tuples.
*/
TransactionId xmin; /* all XID < xmin are visible to me */
TransactionId xmax; /* all XID >= xmax are invisible to me */
/*
* For normal MVCC snapshot this contains the all xact IDs that are in
* progress, unless the snapshot was taken during recovery in which case
* it's empty. For historic MVCC snapshots, the meaning is inverted, i.e.
* it contains *committed* transactions between xmin and xmax.
*
* note: all ids in xip[] satisfy xmin <= xip[i] < xmax
*/
TransactionId *xip;
uint32 xcnt; /* # of xact ids in xip[] */
...
}
A snapshot’s xmin
is calculated the same way as a
transaction’s (i.e. the lowest xid
amongst running
transactions when the snapshot is created), but for a
different purpose. This xmin
is a lower boundary for data
visibility. Tuples created by a transaction with xid <
xmin
are visible to the snapshot.
It also defines an xmax
, which is set to the last
committed xid
plus one. xmax
tracks the upper bound of
visibility; transactions with xid >= xmax
are invisible
to the snapshot.
Lastly, a snapshot defines *xip
, an array of all of the
xid
s of transactions that were in progress when the
snapshot was created. *xip
is needed because even though
there’s already a visibility boundary with xmin
, there
may still be some transactions that are already committed
with xid
s greater than xmin
, but also greater than a
xid
of an in-progress transaction (so they couldn’t be
included in xmin
).
We want the results any committed transactions with xid >
xmin
to be visible, but the results of any that were in
flight hidden. *xip
stores the list of transactions that
were active when the snapshot was created so that we can
tell which is which.
When you execute a BEGIN
, Postgres puts some basic
bookkeeping in place, but it will defer more expensive
operations as long as it can. For example, the new
transaction isn’t assigned a xid
until it starts
modifying data to reduce the expense of tracking it
elsewhere in the system.
The new transaction also won’t immediately get a snapshot.
It will when it runs its first query, whereupon
exec_simple_query
(in postgres.c
)
will push one onto a stack. Even a simple SELECT 1;
is
enough to trigger it:
static void
exec_simple_query(const char *query_string)
{
...
/*
* Set up a snapshot if parse analysis/planning will need one.
*/
if (analyze_requires_snapshot(parsetree))
{
PushActiveSnapshot(GetTransactionSnapshot());
snapshot_set = true;
}
...
}
Creating the new snapshot is where the machinery really
starts coming to life. Here’s GetSnapshotData
(in
procarray.c
):
Snapshot
GetSnapshotData(Snapshot snapshot)
{
/* xmax is always latestCompletedXid + 1 */
xmax = ShmemVariableCache->latestCompletedXid;
Assert(TransactionIdIsNormal(xmax));
TransactionIdAdvance(xmax);
...
snapshot->xmax = xmax;
}
This function does a lot of initialization, but like we
talked about, some of its most important work is set to the
snapshot’s xmin
, xmax
, and *xip
. The easiest of these
is xmax
, which is retrieved from shared memory managed by
the postmaster. Every transaction that commits notifies the
postmaster that it did, and latestCompletedXid
will be
updated if the xid
is higher than what it already holds.
(more on this later).
Notice that it’s the function’s responsibility to add one
to the last xid
. This isn’t quite as trivial as
incrementing it because transaction IDs in Postgres are
allowed to wrap. A transaction ID is defined as a simple
unsigned 32-bit integer (from c.h):
typedef uint32 TransactionId;
Even though xid
s are assigned only opportunistically (as
mentioned above, reads don’t need one), a system doing a
lot of throughput can easily hit the bounds of 32 bits, so
the system needs to be able to wrap to “reset” the xid
sequence as necessary. This is handled by some preprocessor
magic (in transam.h):
#define InvalidTransactionId ((TransactionId) 0)
#define BootstrapTransactionId ((TransactionId) 1)
#define FrozenTransactionId ((TransactionId) 2)
#define FirstNormalTransactionId ((TransactionId) 3)
...
/* advance a transaction ID variable, handling wraparound correctly */
#define TransactionIdAdvance(dest) \
do { \
(dest)++; \
if ((dest) < FirstNormalTransactionId) \
(dest) = FirstNormalTransactionId; \
} while(0)
The first few IDs are reserved as special identifiers, so
we always skip those and start at 3
.
Back in GetSnapshotData
, we get xmin
and xip
by
iterating over all running transactions (again, see
Snapshots above for an explanation of what
these do):
/*
* Spin over procArray checking xid, xmin, and subxids. The goal is
* to gather all active xids, find the lowest xmin, and try to record
* subxids.
*/
for (index = 0; index < numProcs; index++)
{
volatile PGXACT *pgxact = &allPgXact[pgprocno];
TransactionId xid;
xid = pgxact->xmin; /* fetch just once */
/*
* If the transaction has no XID assigned, we can skip it; it
* won't have sub-XIDs either. If the XID is >= xmax, we can also
* skip it; such transactions will be treated as running anyway
* (and any sub-XIDs will also be >= xmax).
*/
if (!TransactionIdIsNormal(xid)
|| !NormalTransactionIdPrecedes(xid, xmax))
continue;
if (NormalTransactionIdPrecedes(xid, xmin))
xmin = xid;
/* Add XID to snapshot. */
snapshot->xip[count++] = xid;
...
}
...
snapshot->xmin = xmin;
Transactions are committed through CommitTransaction
(in
xact.c
). This function is monstrously complex,
but here are a few of its important parts:
static void
CommitTransaction(void)
{
...
/*
* We need to mark our XIDs as committed in pg_xact. This is where we
* durably commit.
*/
latestXid = RecordTransactionCommit();
/*
* Let others know about no transaction in progress by me. Note that this
* must be done _before_ releasing locks we hold and _after_
* RecordTransactionCommit.
*/
ProcArrayEndTransaction(MyProc, latestXid);
...
}
Postgres is entirely designed around the idea of durability, which dictates that even in extreme events like a crash or power loss, committed transactions should stay committed. Like many good systems, it uses a write-ahead log (WAL, or “xlog”) to achieve this durability. All changes are written and flushed to disk, and even in the event of a sudden termination, Postgres can replay what it finds in the WAL to recover any changes that didn’t make it into its data files.
RecordTransactionCommit
from the snippet above handles
getting a change in transaction state to the WAL:
static TransactionId
RecordTransactionCommit(void)
{
bool markXidCommitted = TransactionIdIsValid(xid);
/*
* If we haven't been assigned an XID yet, we neither can, nor do we want
* to write a COMMIT record.
*/
if (!markXidCommitted)
{
...
} else {
XactLogCommitRecord(xactStopTimestamp,
nchildren, children, nrels, rels,
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
InvalidTransactionId /* plain commit */ );
....
}
if ((wrote_xlog && markXidCommitted &&
synchronous_commit > SYNCHRONOUS_COMMIT_OFF) ||
forceSyncCommit || nrels > 0)
{
XLogFlush(XactLastRecEnd);
/*
* Now we may update the CLOG, if we wrote a COMMIT record above
*/
if (markXidCommitted)
TransactionIdCommitTree(xid, nchildren, children);
}
...
}
Along with the WAL, Postgres also has a commit log (or
“clog” or “pg_xact”) which summarizes every transaction and
whether it committed or aborted. This is what
TransactionIdCommitTree
is doing above – the bulk of the
information is written out to WAL first, then
TransactionIdCommitTree
goes through and sets the
transaction’s status in the commit log to “committed”.
Although the commit log is called a “log”, it’s really more of a bitmap of commit statuses split across a number of pages in shared memory and on disk. In an example of the kind of frugality rarely seen in modern programming, the status of a transaction can be recorded in only two bits, so we can store four transactions per byte, or 32,768 in a standard 8k page.
#define TRANSACTION_STATUS_IN_PROGRESS 0x00
#define TRANSACTION_STATUS_COMMITTED 0x01
#define TRANSACTION_STATUS_ABORTED 0x02
#define TRANSACTION_STATUS_SUB_COMMITTED 0x03
#define CLOG_BITS_PER_XACT 2
#define CLOG_XACTS_PER_BYTE 4
#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
While durability is important, performance is also a value
that’s core to the Postgres philosophy. If a transaction
was never assigned a xid
, Postgres skips writing it to
the WAL and commit log. If a transaction was aborted, we
still write its aborted status to the WAL and commit log,
but don’t bother to immediately flush (fsync) because even
in the event of a crash, we wouldn’t lose any information.
During crash recovery, Postgres would notice the unflagged
transactions, and assume that they were aborted.
TransactionIdCommitTree
(in transam.c, and
its implementation TransactionIdSetTreeStatus
in
clog.c) commits a “tree” because a commit
may have subcommits. I won’t go into subcommits in any
detail, but it’s worth noting that because
TransactionIdCommitTree
cannot be guaranteed to be
atomic, each subcommit is recorded as committed separately,
and the parent is recorded as a final step. When Postgres
is recovering after a crash, subcommit records aren’t
considered to be committed (even if they’re marked as such)
until the parent record is read and confirmed committed.
Once again this is in the name of atomicity; the system could have successfully recorded every subcommit, but then crashed before it could write the parent.
Here’s what that looks like in clog.c
:
/*
* Record the final state of transaction entries in the commit log for
* all entries on a single page. Atomic only on this page.
*
* Otherwise API is same as TransactionIdSetTreeStatus()
*/
static void
TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
TransactionId *subxids, XidStatus status,
XLogRecPtr lsn, int pageno)
{
...
LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
/*
* Set the main transaction id, if any.
*
* If we update more than one xid on this page while it is being written
* out, we might find that some of the bits go to disk and others don't.
* If we are updating commits on the page with the top-level xid that
* could break atomicity, so we subcommit the subxids first before we mark
* the top-level commit.
*/
if (TransactionIdIsValid(xid))
{
/* Subtransactions first, if needed ... */
if (status == TRANSACTION_STATUS_COMMITTED)
{
for (i = 0; i < nsubxids; i++)
{
Assert(ClogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
TransactionIdSetStatusBit(subxids[i],
TRANSACTION_STATUS_SUB_COMMITTED,
lsn, slotno);
}
}
/* ... then the main transaction */
TransactionIdSetStatusBit(xid, status, lsn, slotno);
}
...
LWLockRelease(CLogControlLock);
}
With the transaction recorded to commit log, it’s safe to
signal its completion to the rest of the system. This
happens in the second call in CommitTransaction
above
(into procarray.c):
void
ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
{
/*
* We must lock ProcArrayLock while clearing our advertised XID, so
* that we do not exit the set of "running" transactions while someone
* else is taking a snapshot. See discussion in
* src/backend/access/transam/README.
*/
if (LWLockConditionalAcquire(ProcArrayLock, LW_EXCLUSIVE))
{
ProcArrayEndTransactionInternal(proc, pgxact, latestXid);
LWLockRelease(ProcArrayLock);
}
...
}
static inline void
ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
TransactionId latestXid)
{
...
/* Also advance global latestCompletedXid while holding the lock */
if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
latestXid))
ShmemVariableCache->latestCompletedXid = latestXid;
}
You may be wondering what a “proc array” is. Unlike many
other daemon-like services, Postgres uses a process
forking model to handle concurrency instead of threading.
When it accepts a new connection, the Postmaster forks a
new backend (in postmaster.c
). Backends
are represented by the PGPROC
structure (in
proc.h
), and the entire set of active processes
is tracked in shared memory, thus “proc array”.
Now remember how when we created a snapshot we set its
xmax
to latestCompletedXid + 1
? By setting
latestCompletedXid
in global shared memory to the xid
of the transaction that just committed, we’ve just made its
results visible to every new snapshot that starts from this
point forward across any backend.
Take a look at the lock acquisition and release calls on
the lines with LWLockConditionalAcquire
and
LWLockRelease
. Most of the time, Postgres is perfectly
happy to let processes do work in parallel, but there are a
few places where locks need to be acquired to avoid
contention, and this is one of them. Near the beginning of
this article we touched on how transactions in Postgres
commit or abort in serial order, one at a time.
ProcArrayEndTransaction
acquires an exclusive lock so
that it can update latestCompletedXid
without having its
work negated by another process.
Throughout this entire process, a client has been waiting synchronously for its transaction to be confirmed. Part of the atomicity guarantee is that false positives where the databases signals a transaction as committed when it hasn’t been aren’t possible. Failures can happen in many places, but if there is one, the client finds out about it and has a chance to retry or otherwise address the problem.
We covered earlier how visibility information is stored on
heap tuples. heapgettup
(in heapam.c) is the
method responsible for scanning the heap for tuples that
meet a snapshot’s visibility criteria:
static void
heapgettup(HeapScanDesc scan,
ScanDirection dir,
int nkeys,
ScanKey key)
{
...
/*
* advance the scan until we find a qualifying tuple or run out of stuff
* to scan
*/
lpp = PageGetItemId(dp, lineoff);
for (;;)
{
/*
* if current tuple qualifies, return it.
*/
valid = HeapTupleSatisfiesVisibility(tuple,
snapshot,
scan->rs_cbuf);
if (valid)
{
return;
}
++lpp; /* move forward in this page's ItemId array */
++lineoff;
}
...
}
HeapTupleSatisfiesVisibility
is a preprocessor macro that
will call into “satisfies” function like
HeapTupleSatisfiesMVCC
(in tqual.c
):
bool
HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Buffer buffer)
{
...
else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
return false;
else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
HeapTupleHeaderGetRawXmin(tuple));
...
/* xmax transaction committed */
return false;
}
XidInMVCCSnapshot
does an initial check to see whether
the tuple’s xid
is visible according to the snapshot’s
xmin
, xmax
, and xip
. Here’s a simplified
implementation that shows the checks on each (from
tqual.c
):
static bool
XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
{
/* Any xid < xmin is not in-progress */
if (TransactionIdPrecedes(xid, snapshot->xmin))
return false;
/* Any xid >= xmax is in-progress */
if (TransactionIdFollowsOrEquals(xid, snapshot->xmax))
return true;
...
for (i = 0; i < snapshot->xcnt; i++)
{
if (TransactionIdEquals(xid, snapshot->xip[i]))
return true;
}
...
}
Note the function’s return value is inverted compared to
how you’d think about it intuitively – a false
means
that the xid
is visible to the snapshot. Although
confusing, you can follow what it’s doing by comparing the
return values to where it’s invoked.
After confirming that the xid
is visible, Postgres checks
its commit status with TransactionIdDidCommit
(from
transam.c
):
bool /* true if given transaction committed */
TransactionIdDidCommit(TransactionId transactionId)
{
XidStatus xidstatus;
xidstatus = TransactionLogFetch(transactionId);
/*
* If it's marked committed, it's committed.
*/
if (xidstatus == TRANSACTION_STATUS_COMMITTED)
return true;
...
}
Further exploring the implementation of
TransactionLogFetch
will reveal that it works as
advertised. It calculates a location in the commit log from
the given transaction ID and reaches into it to get that
transaction’s commit status. Whether or not the transaction
committed is used to help determine the tuple’s visibility.
The key here is that for purposes of consistency, the commit log is considered the canonical source for commit status (and by extension, visibility) 3. The same information will be returned regardless of whether Postgres successfully committed a transaction hours ago, or seconds before a crash that the server is just now recovering from.
HeapTupleSatisfiesMVCC
from above does one more thing
before returning from a visibility check:
SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
HeapTupleHeaderGetRawXmin(tuple));
Checking the commit log to see whether a tuple’s xmin
or
xmax
transactions are committed is an expensive
operation. To avoid having to go to it every time, Postgres
will set special commit status flags called “hint bits” for
a heap tuple that is scanned. Subsequent operations can
check the tuple’s hint bits and are saved a trip to the
commit log themselves.
When I run a transaction against a database:
BEGIN;
SELECT * FROM users WHERE email = 'brandur@example.com';
INSERT INTO users (email) VALUES ('brandur@example.com')
RETURNING *;
COMMIT;
I don’t stop to think about what’s going on. I’m given a powerful high level abstraction (in the form of SQL) which I know will work reliably, and as we’ve seen, Postgres does all the heavy lifting under the covers. Good software is a black box, and Postgres is an especially dark one (although with pleasantly accessible internals).
Thank you to Peter Geoghegan for patiently answering all my amateur questions about Postgres transactions and snapshots, and giving me some pointers for finding relevant code.
1 A few words of warning: the Postgres source code is pretty overwhelming, so I’ve glossed over a few details to make this reading more digestible. It’s also under active development, so the passage of time will likely render some of these code samples quite obsolete.
2 Readers may notice that while xmin
and xmax
are
fine for tracking a tuple’s creation and deletion, they
aren’t to enough to handle updates. For brevity’s sake, I’m
glossing over how updates work for now.
3 Note that the commit log will eventually be truncated,
but only beyond the a snapshot’s xmin
horizon, and
therefore for the visibility check short circuits before
having to make a check in WAL.
Did I make a mistake? Please consider sending a pull request.