12-6 · Chapter 12 · 6 min read

Recovery Drills: Why You Should Practice Failure Before It Hits Production

A few months ago, a team I worked with had a well-documented recovery plan. It lived in their wiki, complete with diagrams, step-by-step procedures, and a

Recovery Drills: Why You Should Practice Failure Before It Hits Production

A few months ago, a team I worked with had a well-documented recovery plan. It lived in their wiki, complete with diagrams, step-by-step procedures, and a list of who to call when things went wrong. Everyone felt prepared. Then one Tuesday evening, a database migration silently corrupted a column that the checkout service depended on. The team pulled up the wiki, followed the rollback steps, and discovered that the script no longer worked. A recent infrastructure change had renamed a few resources, but nobody had updated the recovery plan. What should have been a five-minute rollback turned into a two-hour firefight.

That gap between a written plan and the ability to execute it under pressure is real. And the only way to close it is to practice.

The Problem With Written Plans

Written recovery plans are useful. They force you to think through failure scenarios, document dependencies, and assign responsibilities. But a plan on paper is a hypothesis. It assumes that the steps are still accurate, that the scripts still run, that the people involved still know what to do, and that nothing has changed since the document was last updated.

In practice, software systems change constantly. Dependencies get upgraded, configurations shift, database schemas evolve, and team members come and go. A recovery plan that was perfect six months ago might be completely broken today. The only way to know is to try running it.

This is not about distrusting your team. It is about acknowledging that complex systems have invisible gaps. A rollback script might fail because a new environment variable was added. A verification step might take too long because the monitoring dashboard was reconfigured. A team member might not have the right database credentials because they joined last week. These gaps only surface when you actually execute the plan.

What Recovery Drills Look Like

Recovery drills are simulated failure scenarios run in a safe environment. The goal is not to induce panic. The goal is to test whether your recovery procedures actually work. There are several common formats.

Game days are the most structured. The team sets aside a full day or half-day to run through multiple failure scenarios. For example, you might deliberately take down a critical service and observe whether the automatic failover kicks in as expected. Or you might simulate a corrupted database and see how long it takes to restore from a backup. Game days are comprehensive, but they require significant preparation and coordination.

Failure drills are shorter and more focused. You pick one specific scenario and run through it in an hour or two. For example, simulate that the latest deployment introduced a bug in the payment flow, and then time how long it takes the team to roll back and verify that the previous version is healthy. Failure drills are easier to schedule and can be done more frequently.

Tabletop exercises are the lightest form. The team gathers around a whiteboard or a shared document and walks through a failure scenario verbally. No actual systems are touched. This is useful for testing decision-making and communication flows, but it does not validate whether the technical steps actually work.

The most valuable drills are the ones that touch real systems in a staging or pre-production environment. That is where the hidden gaps live.

The diagram below captures the core loop of a recovery drill, from simulation through to plan updates.

flowchart TD A[Simulate failure in staging] --> B[Execute recovery plan as written] B --> C{Observe outcome} C -->|Plan works| D[Document success conditions] C -->|Plan fails| E[Identify gaps] E --> F[Fix scripts, docs, or access] D --> G[Update recovery plan] F --> G G --> H[Schedule next drill] H --> A

Why Failure During Drills Is Valuable

It is tempting to treat recovery drills as pass-fail exercises. If the drill succeeds, the team feels good. If it fails, the team feels embarrassed. But that framing misses the point.

A failure during a drill is a gift. It means you discovered a problem before it hit production. If your rollback script fails during a drill, you have time to fix it. If your verification process takes too long, you can optimize it. If a team member does not know how to trigger the rollback, you can train them. These are problems that would have caused real damage in production, but now they are just learning opportunities.

The opposite is also true. A drill that succeeds might give you false confidence. Maybe the scenario was too simple. Maybe the team was paying extra attention because they knew it was a drill. Maybe the staging environment is configured differently from production. A successful drill is not proof that your recovery plan is solid. It is just evidence that it worked under those specific conditions.

This is why drills should be varied. Rotate scenarios. Change the timing. Introduce unexpected complications. The more realistic the drill, the more useful the feedback.

What Drills Reveal Beyond the Technical Steps

Recovery drills expose things that no document can capture. They reveal who actually has access to run a rollback in production. They show whether the on-call engineer receives the right alerts quickly enough. They test whether communication between teams stays clear when pressure rises.

I have seen drills where the rollback script worked perfectly, but the team spent twenty minutes trying to figure out who had the authority to approve it. I have seen drills where the technical steps were correct, but the monitoring dashboard did not show the right metrics to confirm that the system had recovered. These are organizational and process problems that only surface when you run the drill.

How Often Should You Drill

Frequency depends on how often you deploy and how risky your changes are. Teams that deploy multiple times a day should drill at least once a month. Teams that deploy weekly or biweekly can drill every quarter. The key is consistency. A single drill that nobody repeats is almost useless. The second drill is where you verify that the fixes from the first drill actually work. The third drill is where the process becomes muscle memory.

After each drill, document what you learned and update your recovery plan. Add missing steps. Fix broken scripts. Grant access to the right people. A recovery plan that is updated after every drill becomes a living document that reflects reality, not a static artifact that collects dust.

A Quick Checklist for Your First Drill

If you have never run a recovery drill, start small. Pick one scenario that keeps you up at night. It might be a failed deployment, a corrupted database, or a misconfigured service. Then run through this checklist:

Choose a safe environment (staging or pre-production, not production).
Define what success looks like (e.g., system is healthy within 10 minutes).
Assign roles: who runs the drill, who observes, who takes notes.
Run the scenario and follow your recovery plan exactly as written.
Time every step. Note where the plan was unclear or incomplete.
After the drill, discuss what went wrong and what needs to change.
Update the recovery plan and schedule the next drill.

Do not aim for perfection on the first try. Aim for learning.

The Bottom Line

A recovery plan that has never been tested is just a wish. The only way to know whether your team can actually recover from failure is to simulate failure in a safe environment, observe what happens, and fix what breaks. Recovery drills are not about proving that your plan works. They are about finding the gaps before production finds them for you.