brandur.org

One more note on password migrations: damn it, we did find one bug. Luckily, an internal user ran into it early and it was never experienced publicly.

We run our Postgres service on Postgres, and develop like we prescribe others to, with a core tenet being data consistency. I’d added these CHECK constraints:

ALTER TABLE account
    ADD CONSTRAINT pbkdf2_not_null_check CHECK (
        (algorithm <> 'argon2id-pbkdf2-sha256' AND algorithm <> 'pbkdf2-sha256')
        OR ((algorithm = 'argon2id-pbkdf2-sha256' OR algorithm = 'pbkdf2-sha256')
            AND (pbkdf2_hash_iterations IS NOT NULL))
    );
ALTER TABLE account
    ADD CONSTRAINT pbkdf2_null_check CHECK (
        (algorithm = 'argon2id-pbkdf2-sha256' OR algorithm = 'pbkdf2-sha256')
        OR ((algorithm <> 'argon2id-pbkdf2-sha256' AND algorithm <> 'pbkdf2-sha256')
            AND (pbkdf2_hash_iterations IS NULL))
    );

They’re a little awkward to read, but the first guarantees we have a value for PBKDF2 hash iterations where the password involves PBKDF2. The second guarantees we don’t value a value where it doesn’t.

There was a case in the password reset flow where we’d upgrade a newly reset password to Argon2id, but if the old password was PBKDF2, forgot to empty its PBKDF2 hash iterations. The CHECK constraint failed along with the password reset.

We got a fix out quickly, but there’s an argument to be made that it wasn’t worth breaking a critical path (even temporarily) for data consistency.

The cost/benefit of this case is close enough that I might agree, but would still form a counterargument. A 500 is bad, but when this sort of thing would happen at Stripe (on Mongo, no CHECKs), instead of some 500s you’d end up with a pile of inconsistent data written that someone would have to manually repair afterwards. A small error that was in production for only a minute or two would lead to a multi-week cleanup effort, with many a bespoke migration and outbound email written.

When CHECKs (or FKs, data type validations, etc.) fail, no tainted state makes its way to the database. There’s user impact until the problem is fixed, but no impact afterwards. Also, even if a user request succeeds in the non-CHECK alternative with inconsistent data written, it doesn’t mean that things are actually right, and often leads users having to find out for themselves that something’s broken thanks to the inconsistencies.

We’ll probably keep doing the CHECKs.