Fast as a Service

La Cheminée (a tall multi-colored art project) dans La Défense

The middle days of Heroku were distinguished by two distinct ideological groups, cooperating to build and lay out the future for the platform, but also somewhat at odds with each other. Let’s call them “the Stables” and “the Philosophers”.

As the name would suggest, the Stables were the more conservative of the two, having existing services to run, existing users to think about, and large, haphazard codebases acting to constrain possibilities around making fast and easy progress. Pragmatic day-to-day engineers that cared about the product, but who were also incentivized to avoid overly optimistic changes that’d lead to more middle-of-the-night and weekend pages sent their way.

The Philosophers, inspired by the folk history of early innovation groups like Bell Labs and Xerox PARC, specialized in academic thinking in isolation, often from the lofty heights of an ivory tower. And with limited ability to make real changes to existing products themselves, they relied on their ideas being so impactful that the trickle down effect would inspire Stables to take them up and transform them into reality.

Heroku’s long term failure to meet the ambitious goals it’d set for itself could be at least partially attributed to the failure of both groups. As technical debt and operational burden mounted over the years, the Stables became quagmired, and unable to make the ambitious macro-level product changes that would be needed for the company to make a real market impact to follow up its initial innovations like git push heroku master and buildpacks. The Philosophers had some good ideas that might’ve helped, but no ability to effect real change. The trickle down effect for ideas didn’t work, and none of their prototypes made it to realized product form. Heroku’s platform survived, but never thrived, with little in the way of groundbreaking advances since the introduction of the Cedar stack.

One of those early Philosopher prototypes was a toy program to demonstrate the concept of “fast as a service”. Heroku’s API at the time was not fast. Written in Ruby and leveraging high-magic modules like ActiveRecord, it had all the usual issues like N+1s resulting from hard-to-spot lazy loading problems, and committed many other performance sins like in-band remote service calls, largely the result of an accelerated development schedule during a time when service engineering as a field was still emergent and best practices weren’t well established. API latency manifested in the real world by making the Heroku CLI less responsive. Commands like heroku list took long enough that there was a noticeable delay before results were delivered.

The fast as a service prototype demonstrated a single API endpoint rewritten in Go and was built around the premise that improved API responsiveness was a feature that’d flow into increased user satisfaction. A popular conference slide making the rounds at the time made the case that what would seem like minor differences in responsiveness had major impact on the subjective user experience:

By making every API response fast, it’d make the Heroku CLI feel more like a local program, unlocking new modes of advanced operation, and leading to new heights of productivity. Long chains of commands (e.g. heroku list, heroku info -a api, heroku ps -a api, heroku ps:scale web=2 -a api) would be as fluid as traversing your local file system (ls, stat, ps, cat, …).

The Go API prototype showed an impressive ~10x speed improvement over the Ruby implementation, but it was never a realistic path forward. Implementing only a single API endpoint, and only the easy parts, it ignored the difficult, inconvenient details that’d be required to make a practical alternative that was production ready and backwards compatible. A Go rewrite might’ve been possible, but it’d take on the order of 100x more effort and tenacity to wrangle into reality, more than any Philosopher would be able or willing to apply.

Still, fast as a service a good idea. Responsive web services are objectively a good thing.

While changing existing products with a large base of users is hard, starting greenfield, or from a project of smaller size, is easier. Fast as a service is a concept that I’ve kept in mind while building Bridge, and we’ve managed to mostly achieve it by focusing on it as a goal from the beginning. We’re not doing anything extraordinarily novel, but we’ve made sure that we’re building on solid foundations:

Well, at least I think our API is fast, but it’d be nice to quantify that.

Something I’ve been struggling with over the past couple years is finding a good way to perform ad-hoc queries to get insight into how our stack is running. At my last couple jobs I’ve had access to Splunk, which is one of the better tools on the market, but with a pricing model is so absolutely outrageous that it’s hard to justify introducing it (you’re given the privilege of paying upwards of millions of dollars a year and get to run it all yourself). We have an alternative log provider that our logs drain into, but it can’t perform aggregates of any kind, making it a glorified log viewer that’s borderline useless for operational work.

So a few days ago I did something that I never thought I’d do again, and started putting some of our more critical logs in Postgres.

Never, ever do this, unless:

And even then, it’s a technique that’s probably going to have trouble scaling to enormous sizes.

We have a middleware that generates per-request canonical digests and hands them off to a waiting goroutine which upserts them in batches for every 100 that accumulate (or every 15 seconds, whichever occurs first). A well-defined schema keeps us honest, and provides tab completion when querying from Postgres:

CREATE TABLE canonical_api_line (
    id                               uuid NOT NULL DEFAULT gen_ulid(),
    account_email                    varchar(200),
    account_id                       uuid,
    api_error_code                   varchar(200),
    api_error_internal_code          varchar(200),
    api_error_cause                  varchar(2000),
    api_error_message                varchar(2000),
    auth_internal_name               varchar(200),
    auth_type                        varchar(200),
    content_type                     varchar(200),
    created_at                       timestamptz NOT NULL DEFAULT current_timestamp,
    duration                         interval NOT NULL,
    http_method                      varchar(20) NOT NULL,
    http_path                        varchar(200) NOT NULL,
    http_path_original               varchar(200),
    http_route                       varchar(200), -- may be NULL in case of no route match
    idempotency_key                  uuid,
    idempotency_replay               boolean NOT NULL DEFAULT false,
    ip                               inet NOT NULL,
    query_string                     varchar(2000),
    request_id                       uuid NOT NULL,
    request_payload                  jsonb,
    request_payload_capture_disabled boolean NOT NULL DEFAULT false,
    sentry_event_id                  varchar(200),
    sso_provider                     varchar(200),
    statistics                       jsonb,
    status                           int NOT NULL,
    sudo                             boolean NOT NULL,
    sudo_account_email               varchar(200),
    sudo_account_id                  uuid,
    updated_at                       timestamptz NOT NULL DEFAULT current_timestamp,
    user_agent                       varchar(200),
    x_crunchy_headers                varchar(200)[],
    PRIMARY KEY (created_at, id)
) PARTITION BY RANGE (created_at);

So with Postgres-based canonical lines in place, we can start to produce some numbers. Pure retrievals tend to be faster than mutations, so let’s look at GET actions first, taken at 50th, 95th, and 99th percentiles:

    (percentile_cont(0.50) WITHIN GROUP (ORDER BY duration)) AS p50,
    (percentile_cont(0.95) WITHIN GROUP (ORDER BY duration)) AS p95,
    (percentile_cont(0.99) WITHIN GROUP (ORDER BY duration)) AS p99
FROM canonical_api_line
WHERE http_method = 'GET'
    AND created_at > now() - '24 hours'::interval
GROUP BY http_method, http_route
 http_method |                http_route                | count |       p50       |       p95       |       p99
 GET         | /metric-views/{name}                     | 60567 | 00:00:00.012559 | 00:00:00.020186 | 00:00:00.04098
 GET         | /clusters                                | 13616 | 00:00:00.010909 | 00:00:00.020494 | 00:00:00.034664
 GET         | /clusters/{cluster_id}/upgrade           | 12127 | 00:00:00.021422 | 00:00:00.039452 | 00:00:00.060423
 GET         | /clusters/{cluster_id}/roles/{role_name} | 11717 | 00:00:00.020676 | 00:00:00.038723 | 00:00:00.056532
 GET         | /teams                                   | 11306 | 00:00:00.009896 | 00:00:00.023088 | 00:00:00.040053
 GET         | /account                                 |  9863 | 00:00:00.007844 | 00:00:00.020712 | 00:00:00.034109
 GET         | /clusters/{cluster_id}                   |  9737 | 00:00:00.055412 | 00:00:00.14692  | 00:00:00.240062
 GET         | /health-checks/{name}                    |  5748 | 00:00:00.002259 | 00:00:00.00374  | 00:00:00.006875
 GET         | /clusters/{cluster_id}/status            |   743 | 00:00:00.052975 | 00:00:00.950019 | 00:00:02.878263
 GET         | /clusters/{cluster_id}/databases         |   685 | 00:00:00.602332 | 00:00:11.056697 | 00:00:21.827728

Everything is under 100 ms (and usually well under), with the only outlier being the /clusters/{cluster_id}/databases endpoint which is slower because it’s SSHing out to a Postgres server to see what Postgres databases it has. /clusters/{cluster_id}/status has some bad outliers because it’s also remoting down a layer to our backend service.

Now bucket up all the mutating verbs of DELETE, PATCH, POST, and PUT:

    (percentile_cont(0.50) WITHIN GROUP (ORDER BY duration)) AS p50,
    (percentile_cont(0.95) WITHIN GROUP (ORDER BY duration)) AS p95,
    (percentile_cont(0.99) WITHIN GROUP (ORDER BY duration)) AS p99
FROM canonical_api_line
WHERE http_method IN ('DELETE', 'PATCH', 'POST', 'PUT')
    AND created_at > now() - '24 hours'::interval
GROUP BY http_method, http_route
 http_method |              http_route              | count  |       p50       |       p95       |       p99
 POST        | /metrics                             | 434418 | 00:00:00.00966  | 00:00:00.014127 | 00:00:00.021965
 POST        | /vendor/owl/metrics                  | 263276 | 00:00:00.006226 | 00:00:00.008968 | 00:00:00.014077
 POST        | /vendor/owl/pellet                   |   6355 | 00:00:00.022203 | 00:00:00.039547 | 00:00:00.078622
 POST        | /queries                             |    460 | 00:00:00.414387 | 00:00:13.013604 | 00:00:25.092792
 POST        | /access-tokens                       |    291 | 00:00:00.007704 | 00:00:00.012232 | 00:00:00.017056
 POST        | /sessions                            |     35 | 00:00:00.03221  | 00:00:00.210989 | 00:00:00.430671
 POST        | /clusters/{cluster_id}/roles         |     29 | 00:00:00.0265   | 00:00:00.07772  | 00:00:00.124021
 POST        | /clusters/{cluster_id}/backup-tokens |     28 | 00:00:00.062844 | 00:00:00.300118 | 00:00:00.470098
 POST        | /vendor/heroku/login                 |     17 | 00:00:00.014799 | 00:00:00.022739 | 00:00:00.025979
 DELETE      | /clusters/{cluster_id}               |     14 | 00:00:00.045736 | 00:00:00.091442 | 00:00:00.140435

Like /clusters/{cluster_id}/databases above, POST /queries has to SSH to individual Postgres clusters so it’s going to take longer and have some bad tail latency. POST /clusters and POST /clusters/{cluster_id}/upgrade, which create and upgrade clusters respectively both reach down a layer and do a fair bit of heavy lifting. Still, those would be good candidates for looking into seeing if there’s anything they do which would be moved out-of-band to a background worker.

Overall though, mostly good, and we’re keeping responsiveness on most endpoints close to or under that 100ms target. This is of course duration from the perspective of inside the stack, and the numbers for clients making remote calls to us won’t be as good. There’s always areas for improvement, but especially relatively speaking, I’d say that yes, we are fast as a service.

La Défense

The photo at the top is La Cheminée by Raymond Moretti, located in Paris’ La Défense district. Cities around the world have all trended towards the same car-centric sprawls of noise and pavement, so it’s rare that I’m enamored by one, but La Défense in Paris is genuinely unique. You emerge from the closest metro station onto a pedestrian promenade kilometers in length, surrounded on all sides by art projects like this one and modern glass buildings, and bookkended by La Grande Arche, a massive cube that’s as distinctive to Paris’ skyline as the Eiffel Tower itself.

More neighborhoods like this please!

La Défense

La Défense (again)