Engineering

The Postgres migrations we run without downtime.

By Mohammed Rafi05 Nov 20259 min read
Almost every production database outage we've seen — ours, peers', war stories from friends — has been a migration that should have been routine. The schema change itself was correct. The deploy ran. Something locked, the application timed out, and somewhere between five and forty-five minutes passed before the team realized what was happening. By then the on-call channel was already on fire. We don't run any schema change against a production Postgres database that hasn't passed the following checklist. It's short. It's specific. It's not optional. 1. No ALTER TABLE that takes an ACCESS EXCLUSIVE lock for more than a second. Adding a column with a default value in Postgres versions before 11 rewrote the entire table. In 11+ it's mostly fine, but only if the default is constant. We always add the column as nullable first, backfill in chunks, and add the NOT NULL constraint as a second migration. The constraint addition still takes a lock, but it's brief because the data is already there. 2. No CREATE INDEX without CONCURRENTLY. A non-concurrent index build holds a write lock on the table for the duration of the build. On a 50-million-row table, that can mean tens of minutes of blocked writes. CREATE INDEX CONCURRENTLY takes longer in wall-clock time and uses more I/O, but it doesn't block writes. The only correct choice in production is CONCURRENTLY. The only correct choice in development is also CONCURRENTLY, because we want the same migration script to run in both environments. 3. Backfills run in chunks, not in one statement. A backfill that updates a million rows in a single UPDATE statement is a hold on every row for the duration of the transaction. Every query that touches those rows waits. The right shape is: a loop, a chunk size of one to ten thousand rows, a small sleep between batches, and a way to resume from the last successful chunk. We use a script — not a migration — for backfills. The migration changes the schema. The script fills the data. 4. Every migration is reversible. Not because we expect to roll back — we almost never do. But writing the down migration forces us to think about what the up migration actually did. A migration with no down migration is a migration we haven't fully understood. The down doesn't have to restore data; it has to restore the schema to a state the previous application version can run against. 5. Migrations don't ship with code that depends on them. This is the rule that surprises most teams. We deploy the migration first, in a release that doesn't yet use the new schema. Then we deploy the code that reads or writes the new column. Then, eventually, a third release removes the code that wrote to the old column. This is annoying. It is three deploys to do what feels like one change. It is also why we don't have outages. 6. The query plan is checked, not assumed. Before any migration that adds an index, we run EXPLAIN ANALYZE against the production-shape query — usually against a recent snapshot in staging — both with and without the new index. If the planner picks the new index, great. If it doesn't, we either learn why or we don't ship the index. An index that the planner ignores is a write-side cost with no read-side benefit. 7. Locks are observed, not hoped for. We have a small script that reads pg_stat_activity and pg_locks during every migration, and surfaces long-held locks in a status line. If a migration is holding an ACCESS EXCLUSIVE lock for more than a second, we cancel it. We'd rather rerun the migration with a different shape than discover, three minutes in, that the application is timing out. 8. There is a written rollback for every release that includes a migration. The rollback is short. It says: if X happens, run command Y. If Y doesn't work in thirty seconds, run command Z. The rollback is part of the PR. The on-call engineer reads it before the deploy goes out. It is the cheapest insurance we know how to write. The combined result of all this is that our migrations are boring. They run. The application keeps responding. The on-call channel stays quiet. The reason is not that we're better at Postgres than anyone else — it's that we've made the same checklist twice as long as feels comfortable, and we refuse to skip any of it.
PostgresMigrationsReliabilityDatabase

Have something ambitious in mind?

We reply to every email within 48 hours. Call or async, whichever you prefer.