24-2 · Chapter 24 · 6 min read

When Database Migrations Fail in Production: Three Scenarios That Will Keep You Up at Night

You just ran a migration in production. It completed successfully. No errors, no timeouts, no locked tables. You breathe a sigh of relief and move on to

When Database Migrations Fail in Production: Three Scenarios That Will Keep You Up at Night

You just ran a migration in production. It completed successfully. No errors, no timeouts, no locked tables. You breathe a sigh of relief and move on to the next task.

Two hours later, your phone buzzes. A report is broken. Data looks wrong. A service you forgot about is writing nulls into critical tables. The migration succeeded, but your production system is falling apart.

This is the nightmare of database migrations. Unlike application deployments where failures are usually immediate and obvious, migration failures can hide for hours or days. By the time you notice, the damage has already spread.

Let me show you three real scenarios where migrations go wrong, not because the SQL failed, but because the side effects caught everyone off guard.

Scenario One: The New Column That Broke Everything

Your team needs to add a phone_number column to the users table. The migration runs fine in staging. All tests pass. You push to production with confidence.

The column gets created. No errors. But seconds later, the application starts behaving strangely.

Here's what happened: the production application hasn't been updated yet. The old code is still running, and it sends queries like SELECT * FROM users. That works fine - the new column just gets ignored. The real problem is elsewhere. Another piece of code starts inserting data into phone_number, but it uses a different format than what the new application expects. Phone numbers come in as mixed formats - some with country codes, some without, some with dashes, some without.

Consider the migration that triggered this scenario:

ALTER TABLE users ADD COLUMN phone_number VARCHAR(20);

This looks harmless. But without a NOT NULL constraint or a default value, the column accepts any format. Worse, if the table is large, this ALTER TABLE locks the table for writes during the entire operation. In production, that lock can queue up hundreds of requests in seconds. The real danger isn't the SQL itself — it's that the schema changed before every application instance was ready to handle it.

Now you have inconsistent data in a column that multiple systems will depend on. Your team faces an ugly choice: try to clean up the existing data, or rush the new application version into production before it's ready.

The core issue here is timing. The schema changed before the application code that understands it was fully deployed. In a distributed system, not every instance updates at the same instant. For a brief window - sometimes longer - old code interacts with new schema.

Scenario Two: The Type Change That Broke a Nightly Report

This one is more subtle. Your team decides to change the price column from integer to decimal. Good idea - prices need precision. The migration runs perfectly. No immediate errors. The application seems happy.

But six months ago, someone wrote a report query that treats price as an integer. That query isn't used on the main pages. It runs once a night for a financial report. At 2 AM, the report fails completely. Every query that compares prices with integer values now throws type mismatch errors.

This is what engineers call a blocking change. The schema change didn't break anything visible during the day, but it silently broke a critical batch process that runs at night. By morning, the finance team is asking why yesterday's numbers don't add up.

The dangerous part? You might not discover this failure for hours. And the fix isn't simple. You can't just revert the type change without another migration, which carries its own risks.

Scenario Three: The Deleted Column That Poisoned Three Tables

This is the most dangerous scenario. Your team is confident that the old_status column is no longer used. It was deprecated months ago. Nobody references it in the main application. You write a migration to drop it.

The migration runs smoothly. The column disappears. No errors anywhere.

But there's a background service - a data sync job written by a team that left the company two years ago - that still reads old_status periodically. It doesn't crash when the column is missing. It just starts writing NULL values into other tables. The nulls propagate. Data integrity breaks silently across three different tables over the next two hours.

By the time someone notices, the damage is done. You can't just "undo" the column deletion. The data in those other tables is already corrupted. Recovery requires understanding exactly which rows were affected, reconstructing the missing values from backups, and running careful repair scripts.

Why Database Migrations Are Different From Application Deployments

These three scenarios share a common pattern: the migration executed successfully, but the side effects appeared later. This is what makes database migrations fundamentally different from application deployments.

The three scenarios above share a clear pattern: a successful schema change triggers a delayed failure. The following flowchart maps each scenario from root cause to consequence.

flowchart TD subgraph Scenario1[Scenario 1: New Column] A1[Add phone_number column] --> B1[Old code writes inconsistent formats] B1 --> C1[Data corruption in new column] end subgraph Scenario2[Scenario 2: Type Change] A2[Change price from integer to decimal] --> B2[Nightly report uses integer comparison] B2 --> C2[Report fails at 2 AM] end subgraph Scenario3[Scenario 3: Deleted Column] A3[Drop old_status column] --> B3[Background service writes NULLs] B3 --> C3[NULLs propagate to 3 tables] C3 --> D3[Data integrity broken] end

When an application deployment fails, you usually know immediately. Errors appear in logs. Users report problems. Monitoring alerts fire. You can roll back the application version and restore service quickly.

Database migrations don't work that way. A schema change can:

Create inconsistencies that only appear when new data arrives
Break queries that run on a schedule, not continuously
Cause data corruption that spreads slowly across related tables
Affect services you forgot existed or didn't know about

The worst part? Once the damage is done, you can't simply "undo" a schema change like you revert a code change. A dropped column can't be restored easily, especially if other tables already depend on its absence. A changed data type requires a reverse migration that carries its own risks.

A Practical Checklist Before Your Next Production Migration

Before you run that next migration in production, run through these checks:

Identify all consumers. List every service, cron job, report, and data pipeline that touches the affected table. Don't assume you know them all.
Check for deferred execution. Find queries that run on schedules, batch processes, or background jobs. These are the ones that will fail silently hours later.
Verify backward compatibility. Can old application code still work with the new schema? For at least one deployment cycle, your schema should support both old and new code.
Prepare a recovery plan. Know exactly how you'll restore data if something goes wrong. Test the recovery process, not just the migration.
Run the migration during low traffic. Even with all precautions, give yourself a buffer to catch problems before they affect users.

The Concrete Takeaway

A successful migration is not one that runs without errors. A successful migration is one that doesn't break anything - now, in an hour, or at 3 AM when the nightly report runs. Treat every schema change as a potential time bomb, and verify that all systems, not just the obvious ones, can handle the new structure before you consider the migration done.