Running Database Migrations in Production Without Losing Sleep
The deployment pipeline is green. The code changes have been reviewed and approved. The staging environment looks fine. Then comes the moment every engineer dreads: running the migration against the production database.
This is where the real risk lives. A schema change that works perfectly on your laptop can lock a production table for minutes. A data migration that runs fine with a thousand rows can take hours against a million. And when things go wrong, the application starts returning errors to real users.
The problem is not just technical. It is also about timing, coordination, and knowing when to stop.
When to Run the Migration
Unlike application deployments, which you can often do at any hour, database migrations have direct consequences on query performance and data availability. The safest time to run them is when traffic is low. Many teams call this a downtime window, but that name is misleading. A downtime window does not mean the application must go offline. It means you have agreed on a period when you can make changes that might cause disruption, and you have prepared for that possibility.
The real question is: can your migration run while the application is still serving traffic? If you design your migrations carefully, the answer is often yes. But that depends on understanding what your database does when you run certain commands.
The Locking Problem
The most common source of trouble during production migrations is locking. When you run a command like ALTER TABLE, most databases lock the table to prevent other changes while the operation is in progress. If that command takes a long time, every query from your application that touches that table either waits or fails. Users start seeing slow pages or error messages.
Some databases offer ways to reduce locking. PostgreSQL supports CREATE INDEX CONCURRENTLY, which builds an index without blocking writes. MySQL has similar options for certain operations. But not every change can be done without a lock. Adding a column with a default value, changing a column type, or removing a column often requires an exclusive lock.
The key is to know what your database supports before you write the migration. Check the documentation for your database version. Test the migration on a copy of your production data, not just a small sample. Measure how long the lock would last under realistic conditions.
For example, here is a safe pattern for adding a column with a default value in PostgreSQL, avoiding a long exclusive lock:
-- Step 1: Add the column as nullable (fast, no default rewrite)
ALTER TABLE users ADD COLUMN display_name text;
-- Step 2: Backfill the column in small batches
UPDATE users SET display_name = username WHERE display_name IS NULL LIMIT 1000;
-- Repeat until no rows remain
-- Step 3: Set NOT NULL (fast, no data rewrite)
ALTER TABLE users ALTER COLUMN display_name SET NOT NULL;
Break Large Migrations Into Small Steps
A common mistake is trying to do too much in one migration. For example, adding a new column, backfilling it with data from another column, and then dropping the old column all in one step. This creates a long-running operation that holds locks for the entire duration.
A safer approach is to split the work into multiple smaller migrations:
- Add the new column without a default value.
- Run a background job to populate the new column in batches.
- Verify the data is complete and correct.
- Drop the old column in a separate migration.
Each step can be verified before moving to the next. If something goes wrong at step two, you have not lost anything. You can fix the data and retry. This approach takes longer, but it is far less risky.
Build Safety Checks Into the Pipeline
Your pipeline should not blindly run migrations. It needs to check the current state of the database before making changes. Before the migration starts, the pipeline can check for long-running queries. If there are queries that have been running for more than a few seconds, the migration should wait. Those queries might be holding locks that would block the migration, or the migration might block them.
The following flowchart illustrates the safety check process described above:
During the migration, the pipeline should monitor:
- Execution duration: how long each statement takes
- Lock wait time: whether other queries are waiting on locks
- Error rate: any errors from the database or application
If any of these metrics cross a threshold, the pipeline should stop the migration and notify the team. This is not about being cautious. It is about having a clear mechanism to prevent a small problem from becoming a production incident.
What Happens After the Migration Completes
The migration finished without errors. The pipeline shows green. But the work is not done yet.
Your application might still have old database connections open. Connection pools cache connections, and those cached connections might hold references to the old schema. Some ORMs cache query plans that no longer match the new schema. These issues do not always show up immediately. They can cause subtle errors minutes or hours later.
Many teams restart the application after a migration, or at least drain old connections and let new ones be created. Others wait for a few minutes while monitoring logs and metrics before declaring the migration successful. The exact approach depends on your application stack and how your connection pooling works.
Practical Checklist for Production Migrations
Before you run a migration in production, go through this checklist:
- Test the migration on a copy of production data, not just a development database
- Measure how long each statement takes and what locks it acquires
- Verify the database version supports lock-free alternatives for your operation
- Check for long-running queries before starting the migration
- Set up monitoring for lock wait time and error rate during the migration
- Define a clear abort condition: what metric or error triggers a stop
- Prepare a rollback plan: how to revert the migration if something goes wrong
- Notify the team before and after the migration runs
- Plan to restart the application or refresh connections after the migration
The Takeaway
Running a database migration in production is not about avoiding risk. It is about understanding the risk, breaking it into manageable pieces, and having clear signals for when to stop. The best migrations are the ones that nobody notices because they ran smoothly. The second best are the ones that were stopped early before they caused real damage. The worst are the ones that ran to completion but broke the application in ways that took hours to detect. Build your pipeline to aim for the first category, but always be ready for the second.