29-5 · Chapter 29 · 5 min read

When Auto-Recovering Infrastructure Makes Things Worse

It's 2 AM. Your production application starts throwing connection pool errors. The on-call engineer jumps into the cloud console, tweaks a database

When Auto-Recovering Infrastructure Makes Things Worse

It's 2 AM. Your production application starts throwing connection pool errors. The on-call engineer jumps into the cloud console, tweaks a database parameter, and the system stabilizes. Everyone breathes again.

Twenty minutes later, the alerts come back. Same errors. The engineer checks the console and finds the parameter has been reset to its original value. Your infrastructure-as-code pipeline detected the manual change as "drift" and automatically reverted it. Now you're back in an incident, and the cycle will keep repeating until someone disables the auto-reconciliation mechanism.

This scenario isn't hypothetical. It's what happens when automated reconciliation treats every deviation from code as a problem to fix.

The Problem With Assuming All Drift Is Bad

Automated reconciliation sounds perfect on paper. Your pipeline detects when the real-world infrastructure has drifted from what's defined in code, then automatically applies the correct state. No human intervention needed. No window for drift to persist.

But the core assumption is wrong: not all drift is a mistake. Sometimes changes outside the pipeline happen for legitimate reasons, and those changes are what keep the system running.

During an incident, engineers make emergency changes. During a database migration, teams temporarily modify resources as part of a careful procedure. During load testing, scaling parameters get adjusted on the fly. These are all valid reasons to have infrastructure that doesn't match your code repository.

An automated reconciliation system has no way to distinguish between a destructive change and a lifesaving one. It only knows that actual state differs from desired state, and its job is to restore the desired state. It has no context about why the change was made, whether there's an active incident, or whether the change has been validated by the team.

When Timing Makes Things Worse

The timing of automated reconciliation creates another category of risk. Consider a team in the middle of a complex database migration. They're deliberately modifying several cloud resources manually as part of a multi-step procedure that hasn't been fully captured in IaC yet. If the reconciliation pipeline runs mid-migration, it could delete resources that are in transition, causing data loss or complete migration failure.

This isn't a theoretical edge case. Migrations, infrastructure upgrades, and security incident responses all involve temporary states where the live environment intentionally differs from the codebase. Automated reconciliation that runs during these windows doesn't just cause inconvenience. It can corrupt data, break running services, or undo critical security measures that were applied manually because the situation demanded speed over process.

Controlling the Risk Without Abandoning Automation

The answer isn't to disable automated reconciliation entirely. The answer is to build controls that match how real operations work.

The following flowchart illustrates the problematic auto-reconciliation loop and where the proposed controls intervene.

flowchart TD A[Drift Detected] --> B{Approval Gate?} B -- No --> C[Auto-Revert] C --> D[Incident Repeats] D --> A B -- Yes --> E[Human Review] E --> F{Legitimate Drift?} F -- Yes --> G[Adopt into Code] F -- No --> H[Reconcile Safely] I[Reconciliation Window] --> B J[Change Freeze] --> B K[Exclusion Rules] --> A

Approval Gates for Reconciliation

Instead of having the pipeline automatically apply changes when drift is detected, make it stop at a review stage. Send a notification to the relevant team with details about what changed. Require approval before the reconciliation runs. This gives the team time to check whether the drift should be reverted or adopted into the codebase.

This doesn't mean slow operations. A well-designed approval process can be fast. The key is that a human makes the call, not an automated system that lacks context.

Reconciliation Windows

Define specific time windows when automated reconciliation is allowed to run. For example, only between 9 AM and 5 PM on weekdays. Outside those hours, drift is detected and reported, but not automatically fixed.

This simple rule prevents the 2 AM incident scenario. If an emergency change happens at night, the pipeline will log it and alert the team, but won't undo the fix until morning when someone can review it properly.

Change Freezes

When the team is in a freeze period, turn off all automated reconciliation. Freezes happen before major releases, during audits, or during critical migrations. During a freeze, drift is monitored and logged, but no automatic changes are allowed. The team can re-enable reconciliation after the freeze ends and after confirming that all legitimate changes have been recorded in code.

Exclusion Rules for Dynamic Resources

Some resources change frequently outside the pipeline by design. Auto-scaling groups adjust size based on load. Monitoring tools automatically tune configurations. These resources should be excluded from automated reconciliation or given special rules that allow certain types of drift.

For example, in Terraform you can use the lifecycle block with ignore_changes to prevent the pipeline from reverting legitimate dynamic adjustments:

resource "aws_autoscaling_group" "app" {
  name               = "production-app-asg"
  min_size           = 2
  max_size           = 10
  desired_capacity   = 4
  launch_configuration = aws_launch_configuration.app.id
  vpc_zone_identifier = ["subnet-abc123", "subnet-def456"]

  lifecycle {
    ignore_changes = [
      desired_capacity,
      min_size,
      max_size,
    ]
  }
}

This tells Terraform to ignore changes to the scaling parameters, so manual scaling during load spikes won't be reverted.

This isn't about making exceptions to the rule. It's about recognizing that some infrastructure is inherently dynamic, and treating its normal operational changes as drift creates more problems than it solves.

A Practical Checklist

Before enabling automated reconciliation for any resource group, verify these points:

Can the team override or pause reconciliation during incidents?
Is there a defined reconciliation window that excludes off-hours?
Are dynamic resources like auto-scaling groups excluded or given special rules?
Does the pipeline require human approval before applying reconciliation changes?
Is there a documented process for adopting legitimate drift back into code?

The Real Takeaway

Automated reconciliation is a tool, not a policy. It works well when your infrastructure is stable, your changes all go through the pipeline, and incidents are rare. It works against you when operations are messy, emergencies happen, and humans need to make judgment calls.

The teams that handle this well don't automate everything. They automate the detection and notification of drift, but keep the decision to reconcile in human hands. They build windows and freezes that match their operational reality. They exclude resources that are supposed to be dynamic.

Your infrastructure will drift from your code. Some of that drift will be a problem. Some of it will be the reason your service stayed up. The goal isn't to eliminate drift. The goal is to have enough control to know which kind you're dealing with before you act.