21-4 · Chapter 21 · 6 min read

When Old Data Meets New Schema: Backfilling and Verifying Legacy Records

You have just added a new column to a database table that has been running in production for three years. The application now writes to both the old and

When Old Data Meets New Schema: Backfilling and Verifying Legacy Records

You have just added a new column to a database table that has been running in production for three years. The application now writes to both the old and new structures. But what about the millions of rows that were created before this change? They sit there with NULL values in the new column, holding data that is technically incomplete.

This is the moment where many teams realize that schema migration is not just about adding columns. It is about making sure every existing record catches up with the new rules. The process of filling old data into a new structure is called backfill. And if you do it carelessly, you can lock your production table for hours, slow down your application, and frustrate everyone waiting for queries to return.

Why Backfill Is Not a Simple UPDATE

The naive approach is to run one big UPDATE query that sets the new column for every row where it is NULL. If your table has a few thousand rows, this might finish in seconds. But if you are dealing with tens of millions of rows, that single query can hold a table lock for a long time. Other queries queue up. The application slows down. Users start seeing timeouts.

Backfill needs to be planned with the same care you would give to any production operation. The first step is to define exactly what data needs to be filled. Suppose you added a status column whose value is derived from two existing columns: is_active and deleted_at. Your backfill logic must match the same rules that the new application code uses when writing new records. If the application sets status to 'active' when is_active is true and deleted_at is null, your backfill must do the same.

Batch Processing: The Safe Way

Instead of one massive query, process the data in small batches. Take one thousand or ten thousand rows that still have NULL values in the new column, update them, then pause briefly before moving to the next batch. This gives the database time to breathe and lets other queries proceed without waiting.

Some databases have built-in support for batched updates. PostgreSQL, for example, allows you to use UPDATE ... WHERE ... LIMIT with a subquery. MySQL does not support LIMIT directly in UPDATE, but you can use a loop in a stored procedure or handle batching from your migration script. The key is to keep each batch small enough that the lock duration stays acceptable for your workload.

Here is a rough pattern that works across many databases:

The following flowchart illustrates this safe iterative approach:

For a concrete example, suppose you added a status column derived from is_active and deleted_at. A batched UPDATE in PostgreSQL might look like this:

UPDATE your_table
SET status = CASE
    WHEN is_active = true AND deleted_at IS NULL THEN 'active'
    WHEN is_active = false AND deleted_at IS NULL THEN 'inactive'
    ELSE 'deleted'
END
WHERE id IN (
    SELECT id
    FROM your_table
    WHERE status IS NULL
    LIMIT 1000
);

The inner subquery selects a small batch of primary keys where the new column is still NULL. The outer UPDATE only touches those rows, keeping the lock short. Repeat this in a loop until no NULLs remain.

flowchart TD A[Start] --> B{Any rows with NULL?} B -- Yes --> C[Select batch of 1000 PKs where new_column IS NULL] C --> D[Compute new column value for batch] D --> E[UPDATE rows WHERE id IN (batch)] E --> F[Sleep 100ms] F --> B B -- No --> G[Done]

while there are rows with NULL in new_column:
    select a batch of primary keys where new_column IS NULL
    update those rows with the computed value
    sleep for a short interval

The sleep interval is not about being slow. It is about giving other operations a chance to run. Even 100 milliseconds between batches can make a big difference in how the database behaves under load.

Two Sources of Truth During Backfill

While backfill is running, the application continues to write new data to both the old and new structures. This means you now have two sources of truth for the same logical data:

Data written by the application (always consistent because the application writes both places at once)
Data filled by the backfill process (needs to match the application's logic exactly)

Both must produce the same values for the same input. If your backfill logic has a bug, or if the application logic changed between when the backfill started and when it finished, you will end up with inconsistent data. This is why verification is not optional.

Verification: More Than Checking for NULL

After backfill completes, the natural instinct is to run a quick check: "Are there any NULL values left?" That tells you the column is populated, but it does not tell you whether the values are correct.

Proper verification means comparing the computed value against the expected value for every row. You read the old columns, calculate what the new value should be, and compare it with what is actually stored in the new column. Any mismatch needs investigation.

A mismatch can come from several sources:

A bug in the backfill logic that computes the value differently from the application
Data that changed between the time backfill processed it and the time verification runs
The application writing a different value because its logic was updated mid-migration
Race conditions where a row was updated by the application after backfill but before verification

Each mismatch needs to be traced back to its root cause. Sometimes it is a simple fix. Sometimes it reveals a deeper issue in the migration plan.

Progressive Verification

You do not need to verify the entire table in one pass. Start with a random sample. Pick one thousand rows, compare the values, and see if everything matches. If the sample passes, move to a larger sample. If that passes, run a full verification.

For large tables, full verification does not mean reading every row one by one. You can write a query that counts mismatches without fetching all the data. Something like:

SELECT COUNT(*) FROM table
WHERE computed_value != actual_new_value

If the count is zero, you are clean. If it is not zero, you know how many rows need attention, and you can investigate them in batches.

When Verification Passes

Once verification confirms that every row in the new structure matches the expected value, you have confidence that the data is consistent. But the migration is not finished yet. The application is still writing to both structures. The next step is to stop reading from the old schema and rely entirely on the new one. That transition is a separate phase with its own risks and verification steps.

Backfill and verification are the foundation that makes that transition safe. Without them, you are switching to a new schema with incomplete or incorrect data. With them, you know exactly what you have.

Practical Checklist

Define the exact logic for computing the new value, matching the application code
Choose a batch size that keeps lock duration acceptable for your database
Implement batching with a pause between batches
Run backfill in a maintenance window or low-traffic period
Verify with a random sample before full verification
Investigate every mismatch before proceeding
Document the backfill logic and verification results for future reference

The Concrete Takeaway

Backfill is not a technical chore you rush through. It is the step where you prove that your migration logic works on real data at scale. If you skip verification, you are trusting that your backfill code is perfect and that no data changed during the process. That trust is rarely justified. Run the backfill in batches, verify the results systematically, and only move to the next phase when you have evidence that every record is correct.