When Database Migrations Go Wrong: Why Rolling Forward Beats Rolling Back
You just deployed a database migration that added a phone_number column to the users table. The migration ran successfully. Then your team realized the application code that uses this column hasn't been deployed yet. Every new user registration now fails because the old code tries to INSERT a row without providing a value for the new NOT NULL column.
Your production system is broken. What do you do?
Most teams instinctively reach for the down migration -- the script that reverses the change and removes the column. But that instinct can cause more damage than the original problem. There's a better approach: roll forward.
The Problem with Down Migrations
Down migrations look clean on paper. You write an up migration that adds a column, and a down migration that removes it. If something goes wrong, you run the down migration and everything returns to how it was.
In practice, down migrations are dangerous for several reasons.
First, data loss is almost guaranteed. If any rows were inserted or updated after the up migration ran, those values disappear when you drop the column. You might lose customer data that can't be recovered.
Second, down migrations create a mismatch between your application code and your database schema. If your application code already expects the new column to exist, removing it causes runtime errors. You'd need to deploy the old application code too, which means coordinating multiple rollbacks simultaneously.
Third, down migrations are rarely tested. Teams write them as an afterthought, often with bugs that only surface during an actual emergency. Running an untested script against production during an outage is a recipe for more downtime.
What Is Roll-Forward?
Roll-forward is a strategy where you never reverse a migration. Instead, when a migration causes problems, you write a new migration that fixes the issue. The database moves forward to a corrected state, not backward to a previous state.
Using our earlier example, instead of running a down migration to remove phone_number, you write a new migration that makes the column nullable or adds a default value. The column stays, but the constraint that caused the failures is removed. New user registrations work again, and any data already stored in phone_number is preserved.
Here's what that fix migration looks like in SQL:
-- version_002: fix phone_number constraint
-- This migration makes phone_number nullable so old application code
-- can insert rows without providing a value.
ALTER TABLE users
ALTER COLUMN phone_number DROP NOT NULL;
Each migration becomes a cumulative change. The first migration added the column. The second migration fixed the column's constraint. The migration tracker records both changes in sequence, so you can see that version_002 corrected version_001 without erasing its history.
Why Teams Prefer Roll-Forward
The biggest advantage is zero data loss. If your first migration stored 500 phone numbers before you discovered the problem, those numbers survive the fix. A down migration would have deleted them permanently.
Roll-forward also keeps your application code and database schema aligned. Since the column still exists, any application code that reads phone_number continues to work. You don't need to coordinate a simultaneous code rollback. You fix the database, then fix the application code at your own pace.
This approach mirrors how teams handle bugs in application code. When a bug reaches production, you don't revert the entire codebase to last week's version. You push a fix. Database migrations should work the same way: the fix is a new migration, not a reversal.
When Roll-Forward Gets Complex
Not every roll-forward fix is as simple as changing a column constraint. Consider a migration that changed a column's data type from VARCHAR to INTEGER. If the conversion truncated or corrupted existing data, your fix migration might need to:
- Add a new column with the original data type
- Copy data from the corrupted column, applying transformations to recover values
- Update application code references to use the new column
- Drop the corrupted column in a later migration
This is more work than a simple down migration. But it's also safer. You can run the fix migration in a staging environment first, verify the data recovery logic, and only then apply it to production. A down migration gives you no such safety net -- it just deletes the column and hopes for the best.
The key insight is that roll-forward doesn't require you to predict every possible failure before deployment. You just need confidence that if something goes wrong, your team can write a fix. This shifts the risk from prevention (which is impossible to perfect) to recovery (which is a skill you can practice).
Practical Checklist for Roll-Forward
Before you commit to roll-forward as your team's strategy, make sure these practices are in place:
Use this decision tree when a migration causes trouble:
- Every migration must be reversible in theory, but not necessarily in code. Understand what a down migration would do, but don't write one unless you have a specific reason to need it.
- Test your fix migrations in staging first. Run the original migration, introduce the failure scenario, then apply the fix migration. Verify data integrity afterward.
- Keep your migration tracker reliable. Never modify or delete migration records manually. The tracker is your audit trail for understanding what changed and when.
- Document the failure mode. When you write a fix migration, add a comment explaining what went wrong and why the fix works. This helps future team members who encounter similar patterns.
- Practice roll-forward scenarios. Run a quarterly drill where someone deliberately introduces a bad migration, and the team practices writing and deploying a fix under time pressure.
The Takeaway
Down migrations are a trap. They promise simplicity but deliver data loss, coordination headaches, and untested emergency scripts. Roll-forward treats database changes like code changes: when something breaks, you fix it forward, not backward.
The next time a migration goes wrong, resist the urge to reverse. Write a fix migration instead. Your data, your application, and your team's sanity will thank you.