brandur.org

The Missing Manual for Hacking Postgres

It’s probably obvious that Postgres is my favorite database. One minor grievance that I have with the project is that its documentation is almost entirely optimized for people who ultimately are users of the database rather than developers of it. An unfortunate side effect of this is that none of the repository’s standard files (e.g. README) give much insight into how to get started with the source code.

In numerous places, some references in files and in errors generated by make tasks are actively misleading in that they’ll reference an INSTALL for further instructions. Some investigation will reveal that INSTALL doesn’t actually exist on master; it’s only generated as part of a release.

The excellent Postgres docs of course contain all the information needed to get started with development, but if its has one weakness, it’s that its overwhelming verbosity tends to obscure information.

Here I’ve tried to assemble some succinct instructions for getting started that are useful and more importantly, succinct. I don’t expect most of them to change all that much, but I’ll try to keep the document up-to-date in case they do.

It’s often desirable to have a stable release of Postgres running on your machine for day-to-day work along with your experimental build, so you may want to choose a non-standard install directory, data directory, and port for development.

A prefix is passed during configure to specify the target install directory. I use ./build in the current directory and name it $PG_BUILD_DIR. I call my data directory $PG_DATA_DIR.

A port can be overridden with a command line argument to a server or client command like psql. It can also be overridden for an entire session by setting the PGPORT environmental variable. I’ve chosen 5433 as my port.

I use the excellent direnv to manage these variables. It reads them out of an .envrc in the source directory:

export PG_BUILD_DIR="$PWD/build"
export PG_DATA_DIR="$PWD/data/primary"
export PGPORT=5433

(Be sure to direnv allow after saving the file.)

Clone the repository:

git clone https://github.com/postgres/postgres.git

Run configure with a prefix pointing to your chosen target build directory. Also, to save you some time later, we’ll pass a few other useful options that will enable us to debug with tools like gdb:

./configure --enable-cassert --enable-debug --prefix $PG_BUILD_DIR CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer"

Then build it. The -j option gives you some parallelism which will speed things up for any computer that’s still running today.

make -j16 -s

The options passed are:

  • -j: Build in parallel. Pick a number based off the number of cores your computer has. I’m using an iMac Pro with 8 cores, each of which is hyper-threaded, so I specify a parallelism of 16.

  • -s: Build quietly. Normally build commands produce a lot of output which can obscure warnings emitted higher up in the trace. Using -s prevents this and produces cleaner output.

Install the result to the prefix configured above:

make install -j16 -s

Initialize a data directory and start an instance of Postgres right in your terminal. This is convenient because you can see any logging that it emits and you can restart it easily with Ctrl+C.

mkdir -p $PG_DATA_DIR

# initialize a data directory
$PG_BUILD_DIR/bin/initdb -D $PG_DATA_DIR

# start the server
$PG_BUILD_DIR/bin/postgres -D $PG_DATA_DIR -p $PGPORT

Now create a database and connect to it:

$PG_BUILD_DIR/bin/createdb -p $PGPORT brandur-test
$PG_BUILD_DIR/bin/psql -p $PGPORT brandur-test

Postgres doesn’t have much in the way of standard unit testing, but instead relies heavily on a thorough regression suite. Run it with:

make check

The command will start a new server, set it up, run the suite, and then tear it down. This is a reliable way to get consistent results, but is somewhat slow. A faster version is also provided which can use a server that you already have running elsewhere:

# requires $PGPORT to be set in the environment
make installcheck

There’s also a parallel version available to further improve speed (you should basically always prefer this variant):

# requires $PGPORT to be set in the environment
make installcheck-parallel

Building and testing Postgres is already pretty fast (with parallel commands, make for me takes ~30s from scratch and running the test suite takes ~15s), but if you’re going to be working with it heavily, you might want to take a few steps to make it even faster.

ccache is a clever little program that pretends to be your compiler target and caches results so that they can be returned immediately the next time it’s run with the same inputs.

It’s trivial to install (on Mac OS, I use a simple brew install ccache) and causes very few problems, so it’s a pretty easy enhancement.

Use it by telling configure that you want ccache as your C compiler:

./configure --enable-cassert --enable-debug --prefix $PG_BUILD_DIR --with-CC="ccache gcc" CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer"

After warming up ccache by building once, then doing a make clean -j16 -s and building again, my runtime drops from 30s to less than 5s. Incremental compiles are even faster.

If you’re on Linux, you can try the gold linker, which is faster than the GNU linker. Unfortunately, it only supports ELF, so it’s not available to Mac OS users.

Just export it in your $CFLAGS before running configure:

export CFLAGS="-fuse-ld=gold"
./configure ...

Postgres has a slightly unusual tradition of code indentation which seems to have evolved to maximize the number of bytes saved at a time when that mattered, and which continues through to this day. A program similar to Go’s gofmt called pgindent ships with the Postgres source to help automatically reformat source files that are inconsistent.

You may be asked to run pgindent if someone notices that your patch isn’t compliant, and it’s generally a good idea to run it on any sources files that you changed before producing a patch anyway.

A few dependencies need to be installed before pgindent can run. The most up-to-date instructions on how to do that can be found in its README (and hint: perltidy has a Homebrew formula).

After that’s done it can simply be run like so on a C file (where our current directory is the Postgres source root):

src/tools/pgindent/pgindent src/backend/utils/adt/mac.c

Given that pgindent is brittle Perl code and appears to have no test coverage, I’d recommend committing changes before using it on any of your code.

Changes to Postgres are submitted as patch file email attachments to the PG Hackers mailing list. Traditionally, Postgres required that patches were in a particular style called “context format” (as generated by the diff tool’s -c option), but that constraint has since loosened a bit as the “unified diff” (probably what you’re used to seeing from programs like git diff) has become widely considered to more legible.

One good method for producing a patch that will be acceptable on the mailing list is the use of git format-patch 1. This command formats each commit as a separate file named based on the commit message, and includes each entire commit message within the files for extra context. For example:

$ git format-patch master...
0001-Implement-SortSupport-for-macaddr-data-type.patch

Regardless of the tool you use, good commit hygiene is still of paramount importance, so remember to squash and fix using git rebase -i before producing patch files.

If you need to test with a replica, it’s pretty easy to set that up by running a second Postgres instance listening on a different port and tweaking some configuration. Here’s a script that demonstrates how to do that:

#!/bin/sh

set -e

export PG_DIR="$PWD"

export PRIMARY_PORT=5433
export REPLICA_PORT=5434

read -p "Will delete $PG_DIR/data/{primary,replica}. Okay? [Ctrl+C cancels]" yn
rm -rf $PG_DIR/data/primary
rm -rf $PG_DIR/data/replica

# Initialize a new data directory for the primary, then use a bit of a shortcut
# by just copying it for use by the replica.
$PG_DIR/bin/initdb -D $PG_DIR/data/primary/
cp -r $PG_DIR/data/primary/ $PG_DIR/data/replica/

cat <<EOT >> $PG_DIR/data/primary/postgresql.conf
port=$PRIMARY_PORT
EOT

cat <<EOT >> $PG_DIR/data/replica/postgresql.conf
port=$REPLICA_PORT
shared_buffers=500MB
hot_standby=on
hot_standby_feedback=on
EOT

cat <<EOT >> $PG_DIR/data/replica/recovery.conf
standby_mode=on
primary_conninfo='host=127.0.0.1 port=$PRIMARY_PORT user=$USER'
EOT

cat <<EOT >> /dev/stdout
READY!
======

Start primary:
    $PG_DIR/bin/postgres -D $PG_DIR/data/primary

Start replica:
    $PG_DIR/bin/postgres -D $PG_DIR/data/replica

Create a database:
    $PG_DIR/bin/createdb -p $PRIMARY_PORT mydb

Connect to primary:
    $PG_DIR/bin/psql -p $PRIMARY_PORT mydb

Connect to replica:
    $PG_DIR/bin/psql -p $REPLICA_PORT mydb
EOT

1 Note that git format-patch is not officially endorsed and so your mileage with its usage may vary.

Did I make a mistake? Please consider sending a pull request.