Testing Database Migrations Before They Hit Production
You have written a migration script. It looks correct. The syntax is valid. The logic seems right. You run it against your local database, and it works fine. Then you deploy it to production, and everything breaks.
The table had millions of rows. Your local database had twelve. The migration added a NOT NULL column, but production had existing NULL values you did not know about. The ALTER TABLE locked writes for fifteen minutes during peak traffic.
This scenario is not rare. It happens because the environment where you test your migration does not match the environment where it will actually run. A migration that passes on an empty or mismatched database gives false confidence. The fix is not to test more carefully on your local machine. The fix is to build a test environment that reflects production reality.
Why a Matching Schema Matters
A migration script is written against assumptions. It assumes certain columns exist, certain indexes are present, certain constraints are in place. If your test database has a different schema, you are testing against a different set of assumptions.
Consider a migration that adds a NOT NULL column. On a test database with an empty table, the migration runs instantly. On production, the table has rows with NULL values in that column. The migration fails, and now you have a production incident.
The most reliable way to match schemas is to take a schema dump from production and restore it into your test environment. This gives you the exact table structures, indexes, constraints, and data types that exist in production. You are no longer guessing whether your test schema matches reality.
Here is a concrete example using PostgreSQL:
# Dump only the schema (no data) from production
pg_dump --schema-only --no-owner --no-acl production_db > schema.sql
# Restore the schema into your test database
psql test_db < schema.sql
Data That Exposes Real Problems
An empty table or a table with random test data will not reveal edge cases. Migration failures often come from data that exists in production but not in your test set.
If your migration changes a column type, your test data should include edge values: long strings, decimal numbers with many digits, NULL values in that column. If your migration adds a unique constraint, your test data should include duplicate rows that violate that constraint. If your migration removes a column, your test data should include queries that still reference that column.
You do not need to copy the full production dataset. For a table with millions of rows, a few thousand well-chosen rows are enough to make the database engine produce a realistic execution plan. The goal is not to replicate production volume. The goal is to replicate the data patterns that could cause the migration to behave differently.
Dry-Run as a Safety Net
A dry-run executes the migration SQL without permanently changing the database. Some migration tools have a built-in dry-run mode. If yours does not, you can wrap the migration in a transaction and roll it back at the end.
The purpose of a dry-run is not to confirm that the migration succeeds. It is to see warnings, errors, and execution plan changes that might indicate problems. A migration that runs fine on a small table might show a full table scan warning when run against a realistic schema. An ALTER TABLE on a large table might show an estimated execution time that is unacceptable for your maintenance window.
After the dry-run, review the output manually or through automated checks. Look for operations that could lock tables for extended periods. Look for queries that might degrade performance. Look for any unexpected behavior that did not appear during local testing.
Simulating Light Load During Migration
Some migrations are safe on an idle database but cause problems under real traffic. An ALTER TABLE that acquires a lock might complete in seconds when no one else is using the table. Under production load, that same lock could cause query timeouts and application errors.
You do not need a full load test. A simple script that runs SELECT and INSERT queries against the affected tables during the migration is enough. If those simulated queries fail or time out, you have found a problem that would have hit production users.
Run this simulation during the dry-run phase. If the migration causes deadlocks or excessive waits under light load, you need to reconsider your approach. Maybe you need to use a locking strategy that is less disruptive. Maybe you need to break the migration into smaller steps. Maybe you need to schedule it during a lower-traffic window.
Automating the Test Environment
Manually setting up a test environment for every migration is slow and error-prone. The better approach is to automate it as part of your CI pipeline.
When a new migration is committed, the pipeline creates a test environment from a production schema snapshot. It loads the relevant test data. It runs the dry-run. It simulates light load. Then it destroys the environment.
This automation ensures every migration is tested against a consistent baseline. No one can skip the test because they were in a hurry. No one can claim the test passed because they used a different schema. The pipeline enforces the same conditions every time.
A Practical Checklist
Before running a migration in production, confirm these conditions are met:
- Test schema matches production schema exactly
- Test data includes edge cases relevant to the migration
- Dry-run completed without unexpected warnings or errors
- Light load simulation did not cause failures or timeouts
- Test environment was created from a recent production snapshot
What This Means for Your Pipeline
Database migrations are not like application code. You cannot just deploy and observe. A failed migration can corrupt data, lock tables, and cause extended downtime. The cost of a bad migration is much higher than the cost of a bad application release.
Testing migrations in a realistic environment is not optional. It is the difference between knowing your migration will work and hoping it will work. The environment you build for testing does not need to be expensive or complex. It needs to be representative.
Start with a schema dump from production. Add targeted test data. Run dry-runs. Simulate light load. Automate the whole process. Your production database will thank you.