Recovery Plans for High-Risk Infrastructure Changes
You have a change coming up that could break production. Maybe it's a network architecture overhaul, a database migration, or a security group reconfiguration that touches critical services. The team has reviewed the plan, assessed the blast radius, and everyone agrees the change carries real risk.
Now what?
Before anyone runs terraform apply or clicks that deploy button, you need a recovery plan. Not a thick document that sits on a shelf. A practical, specific, executable plan for what to do if things go wrong.
Start With One Question
The entire recovery plan begins with a single question: If this change fails, what do we need to do to get everything back to a safe state?
The answer depends on what you're changing and how far the blast radius extends. A simple configuration change to one security group might have a short answer: reapply the old configuration. A network architecture change or database migration demands a much more detailed response.
Don't overcomplicate this. The recovery plan should be proportional to the risk. A five-line plan is fine for a low-risk change. A multi-step runbook is appropriate for something that could take down a core service.
Three Things Every Recovery Plan Needs
A useful recovery plan has three components, and they all need to be explicit.
Concrete recovery steps. "Rollback to the previous version" is not a step. Write down the exact commands, the servers they run on, the parameters to use, and how to verify the recovery actually worked. If someone needs to SSH into a bastion host and run a specific Terraform command with a specific state file, say that. If they need to restore a database from a snapshot, include the snapshot name and the restore procedure.
Here is a concrete example of what those recovery steps look like for a security group change on AWS:
# Recovery Plan: Revert Security Group Change
# Target: sg-12345678 (production web tier)
# Step 1: Revert security group rules
aws ec2 revoke-security-group-ingress \
--group-id sg-12345678 \
--protocol tcp --port 443 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress \
--group-id sg-12345678 \
--protocol tcp --port 443 --cidr 10.0.0.0/16
# Step 2: Verify the rules are correct
aws ec2 describe-security-groups \
--group-ids sg-12345678 \
--query 'SecurityGroups[0].IpPermissions'
# Step 3: Confirm service is reachable
curl -s -o /dev/null -w "%{http_code}" https://api.example.com/health
Who decides to activate recovery. When things go wrong, people panic. They start making decisions on their own. Some will want to rollback immediately. Others will want to dig into the problem first. You need a single person or role with the authority to say "stop, we're recovering now." This is not a democratic process during an incident. Name the person or role explicitly in the plan.
When to activate the plan. Not every failure triggers an immediate recovery. Sometimes it's better to let the change run while the team investigates. But you need clear boundaries. Two common approaches work well:
- Time-based: If the system isn't stable within 15 minutes after applying the change, start recovery.
- Impact-based: If error rates go above 5 percent, or if users report they can't access the service, start recovery.
Write these thresholds down. Don't rely on people remembering them during an incident.
The Pre-Apply Checklist
Before you apply any high-risk change, there are things that must already be done. This is your pre-apply checklist, and it belongs in the recovery plan.
Common items include:
- Latest snapshot taken
- Infrastructure state file backed up
- Access to recovery systems confirmed
- Team members know their roles during recovery
- Communication channel established for incident coordination
This checklist exists because you cannot prepare for recovery during an emergency. If you didn't take a snapshot before the change, you cannot restore from that snapshot afterward. If you didn't back up the state file, you cannot rollback with confidence.
The following flowchart shows the full flow from change review through pre-apply checks, apply, monitor, and recovery execution if needed.
The pre-apply checklist also serves as a forcing function. It makes the team stop and confirm that recovery is actually possible before making the change. If you can't check off every item, don't apply the change yet.
Who Approves the Recovery Plan
For high-risk infrastructure changes, the recovery plan needs approval from someone who understands the impact and has the authority to accept the risk. This could be a lead engineer, an engineering manager, or a representative from a team that will be affected by the change.
The title doesn't matter. What matters is that the approver can evaluate whether the recovery plan is sufficient or if there are gaps. They should be able to ask questions like "What happens if the rollback also fails?" or "How do we verify the system is healthy after recovery?"
This is not a rubber stamp. If the approver doesn't understand the plan well enough to assess it, they shouldn't approve it.
Store the Plan Where People Can Find It
A recovery plan is useless if nobody can find it during an incident. Don't store it on your laptop. Don't put it in a folder that requires special access. Don't bury it in a long email thread.
Put the plan in a shared location that everyone involved can access quickly. Team wiki, shared drive, or even as a file attached to the pull request that contains the infrastructure change. The goal is to remove any friction between "something went wrong" and "we have the plan in front of us."
A Quick Practical Checklist
Before applying a high-risk infrastructure change, run through this:
- Recovery steps are written out with exact commands and parameters
- A specific person is designated to decide when to activate recovery
- Time or impact thresholds for activating recovery are defined
- Latest snapshot or backup is confirmed available
- Infrastructure state file is backed up
- All team members know their roles during recovery
- The plan is stored in a location accessible to everyone involved
- Someone with authority has reviewed and approved the plan
The Real Test Comes During Execution
A recovery plan that has been prepared and approved is not the same as a recovery plan that works. The only way to know if your plan is solid is to test it. That means running through the recovery steps in a safe environment, verifying that the commands produce the expected results, and confirming that the team can execute the plan under pressure.
But that is a topic for another discussion. For now, the important thing is to have the plan ready before you apply the change. A good recovery plan doesn't guarantee that everything will go smoothly, but it does mean you won't be making decisions in the dark when something breaks.