Backfilling Legacy Data Without Breaking Your Production Database
You just deployed a new migration that adds a last_login_at column to the user table. The schema change went smoothly. But now you look at the data: every existing user has a null value in that column. Their login history from last week, last month, last year -- all of it is invisible to the new field.
This is the moment you need a backfill.
What Backfill Actually Means
Backfill is the process of filling in data that existed before a migration was applied. It's not about moving data to a new structure -- that's what migration scripts handle. Backfill is about bringing old data up to date with the new rules your system now follows.
The situations where backfill becomes necessary are common:
- You added a new column, but existing rows don't have values for it.
- You changed how addresses are stored, from a single text field to separate street, city, and postal code columns.
- You introduced a new calculation, like a risk score for transactions, but past transactions were never scored.
In every case, the data is sitting there, valid but incomplete. The system knows what to do with new data coming in, but the old data is stuck in the previous format.
Why You Cannot Process Everything at Once
The naive approach is to run a single query that updates all rows at once. For a small table with a few hundred rows, that works fine. For a table with millions of rows, it is a disaster waiting to happen.
A single massive update locks rows, consumes transaction logs, and slows down every other query hitting the same table. If your application is serving users during the backfill, those users will experience timeouts, slow responses, or failed requests. The database might even run out of memory or disk space trying to handle the operation.
The solution is to process data in small, controlled chunks.
Batch Processing: The Core Technique
Instead of updating one million rows in one shot, you update ten thousand rows at a time, pause, and then process the next batch. This is called batch processing, and it is the foundation of safe backfills.
Here is how it works in practice:
The following flowchart illustrates the complete backfill loop, including idempotency checks and throttling:
-- Process one batch of rows that still need backfilling
UPDATE users
SET last_login_at = (
SELECT MAX(login_time)
FROM login_history
WHERE login_history.user_id = users.id
)
WHERE last_login_at IS NULL
LIMIT 10000;
After this runs, you check how many rows were affected. If it matches the batch size, you wait a few seconds and run it again. If it returns fewer rows, the backfill is nearly complete.
Choosing the Right Batch Size
There is no universal batch size that works for every database. The right size depends on:
- How powerful your database server is.
- How much load the application is placing on the database.
- How complex the update logic is.
- How much transaction log space is available.
Start with a conservative size, like 5,000 rows. Run a few batches and watch the database metrics: CPU usage, disk I/O, query latency from the application side. If the database handles it easily, double the batch size. If you see spikes in latency or lock contention, cut the size in half.
The goal is to find a batch size that completes in a few seconds without causing noticeable impact on other queries. A batch that takes thirty seconds is probably too large for a production system under normal load.
Throttling: Giving the Database Breathing Room
Batch size controls how much work happens in one unit. Throttling controls how much time passes between units.
After each batch completes, add a deliberate pause before starting the next one. This pause lets the database flush pending writes, release locks, and serve other queries without competition from your backfill.
A typical throttle might be two to five seconds between batches. During peak hours, you might increase that to ten or fifteen seconds. During maintenance windows, you might reduce it to one second or remove it entirely.
The throttle is your safety valve. If something goes wrong -- a sudden spike in application traffic, a slow query from another team, a replication lag warning -- you can increase the pause and let the system stabilize before continuing.
Making Backfills Idempotent
A backfill script must be safe to run multiple times. If a batch fails halfway through, or if you need to restart the entire process, running the same script again should not produce duplicate data or errors.
Idempotency for backfills usually means one of two things:
- Check before write: Only update rows that still have null values or old values.
- Use upsert logic: Insert or update based on whether the row already has the new data.
For the last_login_at example, the query above is already idempotent because it only targets rows where the column is still null. If a batch fails after updating 5,000 rows, the next run will skip those rows and continue with the remaining ones.
For more complex backfills, like recalculating a derived value, you might add a processed_at timestamp column. The backfill script checks whether processed_at is null before processing each row. Once processed, the timestamp is set, and subsequent runs skip that row.
Logging: The Detail Nobody Thinks About Until It Breaks
When a backfill runs for hours, you need to know where it is and whether it is still working correctly. Log every batch:
- Batch number and time range.
- Number of rows processed.
- Duration of the batch.
- Any errors encountered.
- Current progress as a percentage or row count.
This log serves two purposes. First, if the backfill stops unexpectedly, you can resume from the last completed batch instead of starting over. Second, when the backfill finishes, you have a record of exactly what happened, which helps with debugging and auditing.
A simple log entry might look like this:
2025-03-15 14:32:01 | Batch 47 | Processed 10,000 rows | Duration 3.2s | No errors
2025-03-15 14:32:06 | Batch 48 | Processed 10,000 rows | Duration 3.1s | No errors
2025-03-15 14:32:11 | Batch 49 | Processed 10,000 rows | Duration 3.5s | No errors
A Practical Backfill Checklist
Before you run a backfill in production, go through this list:
- Batch size is tested on a staging environment with similar data volume.
- Throttle interval is configured and adjustable without code changes.
- Script is idempotent -- running it twice produces the same result.
- Logging captures batch progress, errors, and timing.
- Rollback plan exists: you can reverse the backfill if something goes wrong.
- Monitoring is in place to detect database performance degradation.
- A dry run has been executed on a copy of production data.
The Takeaway
Backfilling is not a one-time script you write and forget. It is a controlled operation that respects the fact that your database is serving users while you are changing their data. Batch processing and throttling are not optimizations -- they are the minimum requirements for doing this work safely. Without them, you are one large query away from a production incident.