30-6 · Chapter 30 · 6 min read

When Infrastructure Changes Break: A Step-by-Step Recovery Walkthrough

The pipeline turned red. A Terraform apply that should have taken two minutes has been running for fifteen. Your monitoring dashboard shows five resources

When Infrastructure Changes Break: A Step-by-Step Recovery Walkthrough

The pipeline turned red. A Terraform apply that should have taken two minutes has been running for fifteen. Your monitoring dashboard shows five resources that failed to create, and the load balancer health checks are returning 503s. The chat channel is quiet for now, but you know that silence won't last.

This is the moment every infrastructure engineer dreads. Not the failure itself, but the uncertainty that follows: What do you do first? Do you roll back immediately? Do you try to fix it in place? How do you know when things are actually back to normal?

The difference between a controlled recovery and a chaotic scramble comes down to having a clear sequence of actions before you need them. Here is a practical walkthrough of how to run a recovery when infrastructure changes go wrong.

Step 1: Confirm the Failure

The first sign of trouble should not come from a user complaint. It should come from your pipeline and monitoring systems. A well-designed CI/CD pipeline for infrastructure changes includes checkpoints that verify each step: Did the resource create successfully? Is the configuration correct? Is the service still responding properly?

Here is what a typical failure confirmation looks like in practice:

# Check the Terraform plan output for errors
terraform plan -var-file=production.tfvars

# Example output showing a clear failure
# Error: Error creating security group: InvalidGroup.Duplicate: The security group 'web-sg' already exists
#   on main.tf line 42, in resource "aws_security_group" "web":
#   42: resource "aws_security_group" "web" {

# Check the pipeline logs for the failed job
curl -s https://pipeline.internal/api/v1/jobs/12345/logs | tail -50

# Example log snippet
# [ERROR] Terraform apply failed: Error creating security group: InvalidGroup.Duplicate
# [INFO] Retry attempt 1/3...
# [ERROR] Terraform apply failed: Error creating security group: InvalidGroup.Duplicate

When a checkpoint fails, the pipeline stops and signals that something is wrong. But before you jump into recovery mode, take a moment to confirm the failure is real. Monitoring alerts can fire for many reasons: a temporary network glitch, a timeout that resolves on retry, or a false positive from a misconfigured health check.

What to check:

Look at the pipeline logs. Is the error consistent or intermittent?
Check if the same operation succeeds when retried manually.
Verify that the monitoring alert is not a known false positive.

If the failure is confirmed, you move to the next step. If it was a transient issue, document it and move on. No need to trigger a full recovery for a hiccup.

Step 2: Decide the Recovery Strategy

This is where having a pre-written recovery plan pays off. In the heat of the moment, you do not want to debate whether to roll back, apply the old state, or failover to a backup environment. Those decisions should already be documented and agreed upon by the team.

The key factor in this decision is time. Most recovery plans define a rollback window: a period after the change during which a full rollback is safe. If the failure is detected within minutes, rolling back to the previous state is usually the best option. The infrastructure has not had time to drift, and dependent resources are unlikely to have adapted to the new configuration.

The following flowchart summarizes the decision process:

flowchart TD A[Failure confirmed] --> B{Within rollback window?} B -- Yes --> C{Change propagated to other resources?} C -- No --> D[Full rollback] C -- Yes --> E[State reapplication] B -- No --> F{Standby environment available?} F -- Yes --> G[Failover to standby] F -- No --> H[State reapplication] D --> I[Execute recovery] E --> I G --> I H --> I

But if an hour has passed and the change has already propagated to other resources, a simple rollback might cause more problems than it solves. Other systems may have started depending on the new configuration. In that case, the better strategy might be to apply the old state forward, or to failover to a standby environment that was never touched by the failed change.

Three common recovery strategies:

Full rollback: Revert to the exact previous state using your Infrastructure as Code tool.
State reapplication: Apply the last known good configuration without reverting other changes.
Failover: Route traffic to a separate environment that was not affected by the change.

The decision should be guided by your recovery plan, not by what feels right in the moment.

Step 3: Execute the Recovery

Once the strategy is chosen, execute it exactly as documented. This is not the time for improvisation. If your plan says to run terraform apply with a specific state file, run that command. Do not try a different flag or a newer version of the tool because you think it might be faster.

During execution, log every action you take. Note the time, the command, the output, and any unexpected behavior. These logs are not just for post-mortems. They help you track what has been done in case the recovery itself causes new issues.

If the strategy involves failover, activate the mechanism you prepared earlier. This might mean updating DNS records, switching load balancer targets, or changing routing configuration. The exact steps depend on your infrastructure, but the principle is the same: follow the plan, not your intuition.

Step 4: Verify the Recovery

Recovery is not complete until you have verified that everything is back to normal. Do not assume that because the Terraform apply succeeded, the infrastructure is healthy. Do not assume that because the server is online, the application is working.

Verification means checking multiple layers:

Resource state: Are the infrastructure resources in the expected configuration?
Service health: Are the services running and responding to requests?
Application behavior: Can the application perform its core functions?
Dependent systems: Are downstream services that rely on this infrastructure also healthy?

Run the same checks that your pipeline would run during a normal deployment. If you have automated smoke tests, run them. If you have manual verification steps, follow them. The goal is to be certain, not hopeful.

Step 5: Communicate the Outcome

Once verification is complete, tell the rest of the team. Other engineers may be waiting for your infrastructure to stabilize before they deploy their own changes. Operations teams may be monitoring the same alerts and wondering if they need to escalate.

A clear communication should include:

What went wrong
What was done to recover
Whether the recovery was successful
Any ongoing risks or observations

This prevents overlapping changes and reduces confusion. It also helps other teams adjust their plans if needed.

A Practical Recovery Checklist

When the pressure is on, a simple checklist helps you stay focused. Here is one you can adapt for your team:

Confirm the failure is real (not a transient issue or false alarm)
Check the rollback window: how long has it been since the change?
Select the recovery strategy from the pre-written plan
Execute the recovery steps exactly as documented
Log every action and its result
Verify infrastructure state, service health, and application behavior
Communicate the outcome to the team

The Real Work Starts After Recovery

Infrastructure is back to normal. The alerts have cleared. The team can breathe again. But the recovery process is not truly finished until you have answered one question: Why did this happen, and what can we do to prevent it from happening again?

The evaluation after recovery is where you improve your processes. Maybe the pipeline needs better pre-deployment checks. Maybe the recovery plan missed a step. Maybe the team needs a clearer rollback window policy. Whatever the lesson, capture it and update your plans accordingly.

A failed infrastructure change is not a failure of the team. It is a signal that the system needs improvement. The teams that recover well are not the ones that never fail. They are the ones that have a clear, practiced, and repeatable process for when things go wrong.