The other day, we were running a database migration on our primary Postgres database to add a nullable column with no default to a large table. Common wisdom says that adding a nullable column with no default, even on a high volume table, is a safe operation, and doesn’t require any downtime.
Twenty minutes after starting the migration, however, it was still running, and we started seeing some requests timing out and some requests taking much longer than they normally do.
After investigating the other queries that were currently running at the time, we discovered that we were running a backfill of some old data. We stopped the backfill, the migration finished almost immediately, and the site began operating normally again.
Our suspicion was that the migration and some other query were contending for some database-level lock. Upon further examination, the backfill script was reading from the table that we were adding the column to, but not writing to it.
It turns out that even reads from a table, if they take place in the context of a transaction, lock the table in a way that blocks ALTER TABLE commands. In most cases, if transactions finish quickly, this is never an issue. However, in cases where a transaction is run for a while, like in the case of our backfill, this can cause serious production issues.
Postgres Explicit Locking
Postgres has a system of explicit locks that are used to manage control of data. You can take a look at what locks are currently held by querying the pg_locks table:
fin_dev=# select relation::regclass, locktype, mode, granted FROM pg_locks where relation::regclass::text != 'pg_locks';
The relation indicates what specific table is being locked, modeindicates the type of the lock (more on what this means below), and granted indicates whether or not the lock is waiting to be taken, where it is t if it has been taken or f otherwise.
Let’s say that we have a table called messages. Then, let’s start a transaction and perform a query:
begin select * from messages where id = 4;
This creates a lock in Postgres called an ACCESS SHARE lock. This is a lock that applies to the whole table. As the Postgres docs explain, this lock is not an exclusive lock — other sessions and transactions can take ACCESS SHARE locks concurrently.
Looking at the locks table:
relation | locktype | mode | granted -----------+------------+-----------------+--------- messages | relation | AccessShareLock | t
The only lock that ACCESS SHARE prevents another session from taking is the ACCESS EXCLUSIVE lock, a lock which, as the name would suggest, cannot be taken on a table if there are any locks on it currently, and prevents any other locks from being taken once it is acquired.
ACCESS EXCLUSIVE is requested by alter table when we attempt to add a column to the table.
alter table messages add column recipient_id integer null;
And again, looking at the locks table:
relation | locktype | mode | granted -----------+------------+---------------------+--------- messages | relation | AccessShareLock | t messages | relation | AccessExclusiveLock | f
Notice that the ACCESS EXCLUSIVE lock has not been granted, because the ACCESS SHARED lock has not been given up yet.
The alter table command will simply hang until the ACCESS SHARED is given up.
While before the alter table command was issued, other transactions could still query the messages table and also acquire an ACCESS SHARED lock, now, subsequent attempts to take an ACCESS SHARED lock will block on the ACCESS EXCLUSIVE lock (which is blocked on the initial ACCESS SHARED lock):
begin; select * from messages where id = 3;
Now, the locks table features a new lock that has not been granted:
relation | locktype | mode | granted -----------+------------+---------------------+--------- messages | relation | AccessShareLock | t messages | relation | AccessExclusiveLock | f messages | relation | AccessShareLock | f
While it would totally be safe for the new query to be granted the lock, because of the way Postgres lock queuing works, it needs to wait for the ACCESS EXCLUSIVE lock before it can take a lock.
This is why many of our requests started to fail after the migration started.
Takeaways and Prevention Strategies
The main takeaway from this for us was to be more cautious with long running transactions. These transactions can cause serious problems — and when running backfills or any long job it’s crucial to be conscious of how long any particular transaction will be running, and do a better of chunking the work into separate queries if possible.
For workflows that do demand longer transactions, understand that there aren’t simple solutions to running migrations safely. The best you can do in this scenario is to be careful when you run a migration to make sure it doesn’t coincide with a long transaction.
Hope this was helpful to anybody else running into similar problems, and feel free to reach out if you have any questions!
— Sid Shankar