23-7 · Chapter 23 · 7 min read

When Database Migrations Go Wrong: Rollback vs Roll-Forward

Your team just ran a database migration in production. Five minutes later, the monitoring dashboard turns red. Error rates spike. Users start reporting

When Database Migrations Go Wrong: Rollback vs Roll-Forward

Your team just ran a database migration in production. Five minutes later, the monitoring dashboard turns red. Error rates spike. Users start reporting issues. Now you have to make a decision fast: do you undo the change, or do you push another fix forward?

This moment separates teams who have a plan from teams who panic. And the answer is never as simple as "just rollback." The type of change you made, the data involved, and the state of your system all determine which path is safer.

The Two Paths for Recovery

There are fundamentally two ways to recover from a bad database migration. They work differently, carry different risks, and apply to different situations.

Rollback means reversing the migration you just ran. You execute the down migration, which is the exact opposite of what you did. If you added a column, the down migration drops it. If you changed a data type, the down migration changes it back.

Roll-forward means leaving the problematic migration in place and writing a new migration that fixes the issue. You don't go backward. You move forward with a correction.

Both strategies have their place. The trick is knowing which one fits your situation before you need it.

When to Rollback

Rollback works best for changes that are safe to reverse. These are usually non-destructive operations where undoing the change doesn't cause data loss or corruption.

Good candidates for rollback include:

Adding a nullable column
Creating a new index
Adding a new table
Creating a view or function

These changes are additive. When you reverse them, you remove something that was added. No existing data gets lost or corrupted in the process.

Consider a scenario where you added a last_login_at column to your users table. The column is nullable, so existing rows are fine. After deployment, you discover the application code has a bug that writes incorrect timestamps. Rolling back by dropping the column is safe. No data is harmed because the column was empty or contained data you don't need to preserve.

When to Roll-Forward

Roll-forward becomes the better choice when the migration is destructive or when reversing it would cause more harm than the original problem.

Situations where roll-forward is safer:

Dropping a column or table
Changing a data type in a way that loses precision
Modifying existing data values at scale
Merging tables together
Removing a NOT NULL constraint that other systems depend on

Imagine you ran a migration that dropped a legacy_status column. The data in that column is gone. Writing a down migration that adds the column back won't restore the data. Users who depended on that status field are now seeing null values. Your best move is to write a new migration that recreates the column and populates it from a backup or from application logs.

Another common case: you changed a column from VARCHAR to INTEGER, converting string values to numbers. Rolling back by changing the type back to VARCHAR is risky because the integer values might not convert cleanly back to strings. A value of 42 becomes "42", but what about values that were truncated or rounded during the conversion? You've lost information. Roll-forward lets you write a careful migration that handles these edge cases explicitly.

Writing Down Migrations That Actually Work

If you choose to support rollback, your down migrations need real care. They cannot be mechanical reversals of the up migration. Every down migration must account for the data that exists at the moment of rollback.

Here is what makes a down migration dangerous:

Consider a concrete example. You added a nullable column last_login_at to the users table, but the application code has a bug. A safe down migration and a roll-forward fix would look like this:

-- Safe down migration: drop the column, but only after checking it's safe
BEGIN;

-- Step 1: Verify no application code or views depend on this column
-- (This check is done in the deployment pipeline, not in SQL)

-- Step 2: Drop the column
ALTER TABLE users DROP COLUMN IF EXISTS last_login_at;

COMMIT;

-- Roll-forward migration: add the column with the correct name
BEGIN;

-- Add the column with the intended name and type
ALTER TABLE users ADD COLUMN last_login_at TIMESTAMP;

-- Optionally backfill from application logs or a backup
-- UPDATE users SET last_login_at = ... WHERE id IN (...);

COMMIT;

The down migration is safe because the column is nullable and additive. The roll-forward migration fixes the issue without reversing the schema change.

It assumes the data is in the same state as when the up migration ran
It ignores rows that were added or modified after the up migration
It blindly reverses schema changes without checking data integrity

A safe down migration for adding a NOT NULL column with a default value should:

Verify that dropping the column won't break application queries
Handle any rows that were inserted after the column was added
Ensure no foreign key relationships depend on the column

For a data type change from VARCHAR to INTEGER, the down migration needs to handle values that don't have a clean string representation. You might need to cast integers back to strings, but also handle NULLs and edge cases that the original string values didn't have.

The Real Risks You Cannot Ignore

Rollback sounds simple, but it carries serious risks that teams discover only after something goes wrong.

Data loss is the biggest risk. When a migration drops a column, the data is gone. No down migration can bring it back unless you have a backup. If you didn't take a backup before the migration, rollback means accepting permanent data loss.

Migration dependencies create hidden traps. If migration two depends on a column added by migration one, rolling back to before migration one breaks everything. Your application might crash because it expects columns that no longer exist. Your data might become inconsistent because rows reference values that were removed.

Roll-forward has its own risks. The biggest one is time. You need to write a new migration, get it through the pipeline, and deploy it. During that time, your application is running with the broken state. Users are experiencing errors. Your team is under pressure to fix it fast, which increases the chance of making another mistake.

Roll-forward also requires accurate knowledge of the current database state. You cannot write the fix based on assumptions. You need to know exactly what the data looks like right now, not what it looked like when the migration was designed.

Making the Decision Before You Need It

The worst time to decide between rollback and roll-forward is when production is on fire. By then, you are stressed, the clock is ticking, and your judgment is compromised.

A better approach is to classify every migration before it runs. Assign each migration a recovery category:

Safe to rollback: Additive changes like new columns, tables, or indexes
Requires backup before rollback: Changes that modify existing data or drop nullable columns
Roll-forward only: Destructive changes like dropping columns, changing data types, or merging tables

Document this classification in your migration files or in your deployment runbook. When something goes wrong, your team reads the classification and executes the predefined strategy. No debate. No second-guessing.

A Quick Decision Checklist

Before you run any migration in production, ask these questions:

The following decision tree can help you apply the checklist under pressure:

flowchart TD A[Migration fails] --> B{Change additive?} B -- Yes --> C[Consider rollback] B -- No --> D[Choose roll-forward] C --> E{Data loss risk?} E -- Low --> F[Rollback safe] E -- High --> G[Roll-forward instead] D --> H{Code-schema sync?} H -- Yes --> I[Write fix migration] H -- No --> J[Fix code first, then migrate]

Is the change additive or destructive?
Can the down migration restore the exact previous state, including data?
Do you have a verified backup taken before the migration?
Has the down migration been tested in a staging environment?
Does this migration depend on other migrations that ran before it?
What is the cost of downtime while you write a roll-forward fix?

If you cannot answer all of these, do not run the migration in production yet.

The Concrete Takeaway

Rollback and roll-forward are not interchangeable strategies. They apply to different types of changes and carry different risks. The teams that handle database incidents well are not the ones who are fastest at typing SQL. They are the ones who thought about recovery before the migration ran. They classified their changes, tested their down migrations, and had a backup ready. When the dashboard turned red, they did not panic. They executed the plan they already made.