Aug 7, 2019

A few months ago, I wrote about how SortSupport works in
Postgres to vastly speed up sorting on
large data types ^{1} like `numeric`

or `text`

, and
`varchar`

. It works by generating **abbreviated keys** for
values that are representative of them for purposes of
sorting, but which fit nicely into the pointer-sized value
(called a “**datum**”) in memory that Postgres uses for
sorting. Most values can be sorted just based on their
abbreviated key, saving trips to the heap and increasing
sorting throughput. Faster sorting leads to speedup on
common operations like `DISTINCT`

, `ORDER BY`

, and ```
CREATE
INDEX
```

.

A patch of mine was recently committed to add
SortSupport for the `inet`

and `cidr`

types, which by my
measurement, a little more than doubles sorting speed on
them. `inet`

and `cidr`

are the types used to store network
addresses or individual hosts and in either IPv4 or IPv6
(they generally look something like `1.2.3.0/24`

or
`1.2.3.4`

).

`inet`

and `cidr`

have some important subtleties in how
they’re sorted which made designing an abbreviated key that
would be faithful to those subtleties but still efficient,
a non-trivial problem. Because their size is limited,
abbreviated keys are allowed to show equality even for
values that aren’t equal (Postgres will fall back to
authoritative comparison to confirm equality or tiebreak),
but they should never falsely indicate inequality.

A property that’s not necessarily obvious to anyone
unfamiliar with them is that network types (`inet`

or
`cidr`

) can either address a single host (what most people
are used to seeing) or an entire subnetwork of arbitrary
size. For example:

`1.2.3.4/32`

specifies a 32-bit netmask on an IPv4 value, which is 32 bits wide, which means that it defines exactly one address:`1.2.3.4`

.`/128`

would work similarly for IPv6.`1.2.3.0/24`

specifies a 24-bit netmask. It identifies the network at`1.2.3.*`

. The last byte may be anywhere in the range of 0 to 255.Similarly,

`1.0.0.0/8`

specifies an 8-bit netmask. It identifies the much larger possible network at`1.*`

.

We’ll establish the following common vocabulary for each
component of an address (and take for example the value
`1.2.3.4/24`

):

- A
**network**, or bits in the netmask (`1.2.3.`

). - A
**netmask size**(`/24`

which is 24 bits). Dictates the number of bits in the network. - A
**subnet**, or bits outside of the netmask (`.4`

). Only`inet`

carries non-zero bits here, and combined with the network, they identify a single**host**(`1.2.3.4`

).

The netmask size is a little more complex than commonly
understood because while it’s most common to see byte-sized
blocks like `/8`

, `/16`

, `/24`

, and `/32`

, it’s allowed to
be any number between 0 and 32. It’s easy to mentally
extract a byte-sized network out of a value (like `1.2.3.`

out of `1.2.3.4/24`

) because you can just stop at the
appropriate byte boundary, but when it’s not a nice byte
multiple you have to think at the binary level. For
example, if I have the value `255.255.255.255/1`

, the
network is just the leftmost bit. 255 in binary is ```
1111
1111
```

, so the network is the bit `1`

and the subnet is 31
consecutive `1`

s.

The difference between `inet`

and `cidr`

is that `inet`

allows a values outside of the netmasked bits. The value
`1.2.3.4/24`

is possible in `inet`

, but illegal in `cidr`

because only zeroes may appear after the network like
`1.2.3.0/24`

. They’re nearly identical, with the latter
being more strict.

In the Postgres source code, `inet`

and `cidr`

are
represented by the same C struct. Here it is in
`inet.h`

:

```
/*
* This is the internal storage format for IP addresses
* (both INET and CIDR datatypes):
*/
typedef struct
{
unsigned char family; /* PGSQL_AF_INET or PGSQL_AF_INET6 */
unsigned char bits; /* number of bits in netmask */
unsigned char ipaddr[16]; /* up to 128 bits of address */
} inet_struct;
```

In Postgres, `inet`

/`cidr`

sort according to these rules:

- IPv4 always appears before IPv6.
- The bits in the network are compared (
`1.2.3.`

). - Netmask size is compared (
`/24`

). - All bits are compared. Having made it here, we know that
the network bits are equal, so we’re in effect just
comparing the subnet (
`.4`

).

These rules combined with the fact that we’re working at
the bit level produces ordering that in cases may not be
intuitive. For example, `192.0.0.0/1`

sorts *before*
`128.0.0.0/2`

despite 192 being the larger number – when
comparing them, we start by looking at the common bits
available in both networks, which comes out to just one bit
(`min(/1, /2)`

). That bit is the same in the networks of
both values (remember, 192 = `1100 0000`

and 128 = ```
1000
0000
```

), so we fall through to comparing netmask size. `/2`

is the larger of the two, so `128.0.0.0/2`

is the larger
value.

Armed with the structure of `inet`

/`cidr`

and how their
sorting works, we can now design an abbreviated key for
them. Remember that abbreviated keys need to fit into the
pointer-sized Postgres datum – either 32 or 64 bits
depending on target architecture. The goal is to pack in as
much sorting-relevant information as possible while staying
true to existing semantics.

We’ll be breaking the available datum into multiple parts, with information that we need for higher precedence sorting rules occupying more significant bits so that it compares first. This allows us to compare any two keys as integers – a very fast operation for CPUs (faster even than comparing memory byte-by-byte), and also a common technique in other abbreviated key implementations like the one for UUIDs.

The first part is easy: all IPv4 values always appear before all IPv6 values. Since there’s only two IP families, so we’ll reserve the most significant bit of our key to represent a value’s family. 0 for IPv4 and 1 for IPv6.

It might seem short-sighted that we’re assuming that only two IP families will ever exist, but luckily abbreviated keys are not persisted to disk (only in the memory of a running Postgres system) and their format is therefore non-binding. If a new IP family were to ever appear, we could allocate another bit to account for it.

The next comparison that needs to be done is against a value’s network bits, so we should include those in the datum.

The less obvious insight is that we can *only* include
network bits in this part. Think back to our example of
`192.0.0.0/1`

and `128.0.0.0/2`

: if we included 192’s full
bits of `1100 0000`

, then when comparing it to 128’s ```
1000
0000
```

, it would sort higher when it needs to come out
lower. In order to guarantee our keys will comply with the
rules, we have to truncate values to just what appears in
the network.

Both `192.0.0.0/1`

and `128.0.0.0/2`

would appear as ```
1000
0000
```

(two of 128’s bits were extracted, but it has a 0 in
the second position) and would appear equal when
considering this part of the abbreviated key. In cases
where that’s all the space in the key we have to work with,
Postgres will have to fall back to authoritative comparison
(which would be able to move on and compare netmask size)
to break the tie.

The network bits are where we need to stop for most of our use cases because that’s all the space in the datum there is. An IPv6 value is 128 bits – after reserving 1 bit in the datum for family, we have 31 bits left on a 32-bit machine and 63 bits on a 64-bit machine, which will be filled entirely with network. An IPv4 value is only 32 bits, but that’s still more space than we have left on a 32-bit machine, so again, we’ll pack in 31 of them.

But there is one case where we have some space left over: IPv4 on a 64-bit machine. Even after storing all 32 possible bits of network, there’s still 31 bits available. Let’s see what we can use them for.

As datums are being compared for IPv4 on a 64-bit machine,
we can be sure that having looked at the 33 bits that
we’ve designed so far – IP family (1 bit) and
network (32 bits) – are equal. That leaves us with 31
bits (64 - 33) left to work with, and lets us move onto
the next comparison rule – netmask size. The largest
possible netmask size for an IPv4 address is 32, which
conveniently fits into only 6 bits (`32 = 10 0000`

) ^{2}.

After adding netmask size to the datum we’re left with 25
bits (31 - 6), which we can use for the next sorting rule
– subnet. Subnets can be as large as 32 bits for a `/0`

value, so we’ll have to shift any that are too large to fit
down to the size available. That will only ever happen for
netmask sizes of `/6`

or smaller – for all commonly seen
netmask sizes like `/8`

, `/16`

, or `/24`

we can fit the
entirety of the subnet into the datum.

With subnet covered, we’ve used up all the available key
bits, but also managed to cover every sorting rule – with
most ^{3} real-world data, Postgres should be able to sort
almost entirely with abbreviated keys without falling back
to authoritative comparison. The final key design looks
like this:

Now that we have an encoding scheme for each different case, we can build an implementation that puts everything into place. This involves the use of many bitwise operations that are common in C, but which many of us who program in high-level languages day-to-day aren’t as used to.

I’ll go through this implementation step-by-step, but you may prefer to refer to the completed version in the Postgres source, which we’ve made an effort to comment comprehensively.

Recall that an IP component is stored as a 16-byte
`unsigned char`

array in the backing network type:

```
typedef struct
{
...
unsigned char ipaddr[16]; /* up to 128 bits of address */
} inet_struct;
```

Our abbreviated keys will be compared as if they were
integers (one of the reasons that they’re so fast), so the
first step is to extract a datum’s worth of bytes from
`ipaddr`

into an intermediate representation that’ll be
used to more easily separate out the final components.
We’ll use `memcpy`

to copy it out byte-by-byte:

```
Datum ipaddr_datum;
memcpy(&ipaddr_datum, ip_addr(authoritative), sizeof(Datum));
```

`ipaddr`

is laid out most significant byte first, which
will be fine when representing an integer on a big-endian
machine, but no good on one that’s little-endian (like most
of our Intel processors), so do a byte-wise position swap
to re-form it (more detail on this talking about `uuid`

’s
abbreviated key implementation:

```
/* Must byteswap on little-endian machines */
ipaddr_datum = DatumBigEndianToNative(ipaddr_datum);
```

And for IPv6, make sure to shift a 1 bit into the leftmost position so that it sorts after all IPv4 values:

```
Datum res;
res = ((Datum) 1) << (SIZEOF_DATUM * BITS_PER_BYTE - 1);
```

Next we’ll extract the leading **network** component using a
technique called bitmasking. This common technique involves
using a bitwise-AND to extract a desired range of bits:

```
1010 1010 1010 1010 (original value)
& 0000 1111 1111 0000 (bitmask)
-------------------
0000 1010 1010 0000 (final result)
```

We’re going to create a bitmask for the **subnet** portion
of the value (reminder: that’s the last part *after* the
network), and it’s size depends on how many subnet bits we
expect to see in `ipaddr_datum`

. For example, if the
network component occupies bits equal or greater to the
datum’s size, then the subnet bitmask will be zero.

The code’s broken into three separate conditionals. This first section handles the case of no bits in the network components. The subnet bitmask should be all ones, which we get by starting with 0, subtracting 1, and allowing the value to roll over to its maximum value:

```
Datum subnet_bitmask,
network;
subnet_size = ip_maxbits(authoritative) - ip_bits(authoritative);
Assert(subnet_size >= 0);
if (ip_bits(authoritative) == 0)
{
/* Fit as many ipaddr bits as possible into subnet */
subnet_bitmask = ((Datum) 0) - 1;
network = 0;
}
```

The next section is the case where there are some bits for both the network and subnet. We use a trick to get the bitmask which involves shifting a 1 left out by the subnet size, then subtracting one to get 1s in all positions that were right of it:

```
0000 0001 0000 0000 (1 << 8)
- 1 (minus one)
-------------------
0000 0000 1111 1111 (8-bit mask)
```

Getting the network’s value then involves ANDing the IP’s
datum and the *negated* form of the subnet bitmask
(`ipaddr_datum & ~subnet_bitmask`

):

```
else if (ip_bits(authoritative) < SIZEOF_DATUM * BITS_PER_BYTE)
{
/* Split ipaddr bits between network and subnet */
subnet_bitmask = (((Datum) 1) << subnet_size) - 1;
network = ipaddr_datum & ~subnet_bitmask;
}
```

The final case represents no bits in the subnet. Set
`network`

to the full value of `ipaddr_datum`

:

```
else
{
/* Fit as many ipaddr bits as possible into network */
subnet_bitmask = 0; /* Unused, but be tidy */
network = ipaddr_datum;
}
```

Recall that IPv4 on a 64-bit architecture is by far the most complex case because we have room to fit a lot more information. This next section involves taking the network and subnet bitmask that we resolved above and shifting it all into place.

The order of operations is:

`network`

: Shift the network left 31 bits to make room for netmask size and 25 bits worth of subnet.`network_size`

: Shift the network size left 25 bits to make room for the subnet.`subnet`

: Extract a subnet using the bitmask calculated above.`subnet`

: If the subnet is longer than 25 bits, shift it down to just occupy 25 bits.`res`

: Get a final result by ORing the values from (1), (2), and (4) above.

```
#if SIZEOF_DATUM == 8
if (ip_family(authoritative) == PGSQL_AF_INET)
{
/*
* IPv4 with 8 byte datums: keep all 32 netmasked bits, netmask size,
* and most significant 25 subnet bits
*/
Datum netmask_size = (Datum) ip_bits(authoritative);
Datum subnet;
/* Shift left 31 bits: 6 bits netmask size + 25 subnet bits */
network <<= (ABBREV_BITS_INET4_NETMASK_SIZE +
ABBREV_BITS_INET4_SUBNET);
/* Shift size to make room for subnet bits at the end */
netmask_size <<= ABBREV_BITS_INET4_SUBNET;
/* Extract subnet bits without shifting them */
subnet = ipaddr_datum & subnet_bitmask;
/*
* If we have more than 25 subnet bits, we can't fit everything. Shift
* subnet down to avoid clobbering bits that are only supposed to be
* used for netmask_size.
*
* Discarding the least significant subnet bits like this is correct
* because abbreviated comparisons that are resolved at the subnet
* level must have had equal subnet sizes in order to get that far.
*/
if (subnet_size > ABBREV_BITS_INET4_SUBNET)
subnet >>= subnet_size - ABBREV_BITS_INET4_SUBNET;
/*
* Assemble the final abbreviated key without clobbering the ipfamily
* bit that must remain a zero.
*/
res |= network | netmask_size | subnet;
}
else
#endif
```

The three other cases (refer to the figure above) are much
simpler because we only have room for network bits. Shift
them right by 1 bit to not clobber our previously set IP
family, then OR with `res`

for the final result:

```
#endif
{
/*
* 4 byte datums, or IPv6 with 8 byte datums: Use as many of the
* netmasked bits as will fit in final abbreviated key. Avoid
* clobbering the ipfamily bit that was set earlier.
*/
res |= network >> 1;
}
```

The abbreviated key implementation here is complex enough that in most contexts I’d probably consider it a poor trade off – added speed is nice to have, but there is a cost in the ongoing maintenance burden of the new code and its understandability by future contributors.

However, Postgres is a highly leveraged piece of software.
This patch makes sorting and creating indexes on network
types *~twice as fast*, and that improvement will trickle
down automatically to hundreds of thousands of Postgres
installations around the world as they’re upgraded to the
next major version. If there’s one place where trading some
more complexity for speed is worth it, it’s cases like this
one where only very few have to understand the code, but
very many will reap its benefits. We’ve also made sure to
add extensive comments and test cases to keep future code
changes as easy as they can be.

Thanks to Peter Geoghegan for seeding the idea for this patch, as well as for advice and very thorough testing/review, and Edmund Horner for review.

^{1} Technically, pass-by-reference types. Generally those
that can’t fit their entire value in a datum.

^{2} I originally thought that by subtracting one from 32 I
could fit netmask size into only 5 bits (31 = ```
1
1111
```

), but that’s not possible because 0-bit netmasks
are allowed and we therefore need to be able to
represent the entire range of 0 to 32. For example,
`1.2.3.4/0`

is a legal value in Postgres.

^{3} Authoritative comparison will still be needed in the
case of equal network values and values with short
networks (`/6`

or less) that share many leading bits.

Did I make a mistake? Please consider sending a pull request.