25-4 · Chapter 25 · 6 min read

Your Cloud Infrastructure Is Drifting Away From Your Code. Here's How to Catch It.

You've written your infrastructure as code. You've run terraform plan, reviewed the output, and applied the changes. The resources appear in the cloud

Your Cloud Infrastructure Is Drifting Away From Your Code. Here's How to Catch It.

You've written your infrastructure as code. You've run terraform plan, reviewed the output, and applied the changes. The resources appear in the cloud console. The team celebrates another successful deployment. Work is done.

But is it?

A week later, someone with admin access logs into a server and tweaks a configuration to fix an urgent issue. The security team adds a firewall rule through the cloud dashboard without telling anyone. Your cloud provider deprecates an API endpoint, and the configuration that worked perfectly last month now silently does nothing useful.

None of these changes touched your repository. None of them went through code review. None of them are recorded anywhere your team can find later. Your infrastructure is now different from what your code describes. You have a problem.

This situation has a name: drift. It means the actual state of your infrastructure no longer matches the desired state defined in your code. Drift is dangerous because it quietly breaks the promise of infrastructure as code. If you need to recreate an environment from scratch, the result will be different. If an incident happens, you cannot trust that running the same code will restore a safe state. You have lost the ability to reproduce your infrastructure reliably.

How Drift Happens in Practice

Drift is not caused by malicious actors or incompetent teams. It happens through normal, well-intentioned actions that bypass the pipeline.

A developer needs to test something quickly and changes a security group rule manually. A database administrator adjusts a parameter to handle a sudden load spike. A platform engineer patches a server directly because the automated patch process takes too long. Each of these actions makes sense in the moment. Each of them creates a gap between what your code says and what is actually running.

The problem compounds over time. One manual change becomes two, then ten. After a few months, the infrastructure running in production bears little resemblance to the code in your repository. The next time someone runs terraform apply expecting a clean state, they get a long list of unexpected changes. Nobody knows which changes are intentional and which are accidents. Trust in the infrastructure code erodes.

Detecting Drift Before It Causes Problems

The solution is not to forbid manual changes. Sometimes you need to act fast, and the pipeline is too slow. The solution is to detect drift automatically and regularly, so you know about it before it causes an incident.

Drift detection is a process that compares the actual state of your infrastructure against the desired state in your code. It runs on a schedule or is triggered by specific events. When it finds differences, it records the details: which resource changed, what part of the configuration is different, and what the correct value should be.

Most infrastructure as code tools support drift detection natively. Terraform has terraform plan which shows differences between state and configuration. Pulumi has preview commands. AWS CloudFormation has drift detection built into the service. The key is to run these checks automatically, not just when someone remembers to do it.

Here is a simple command you can run in your CI pipeline to detect drift automatically:

#!/bin/bash
# Run terraform plan and exit with a non-zero code if drift is detected
terraform init -input=false
terraform plan -detailed-exitcode -input=false -out=tfplan
PLAN_EXIT_CODE=$?

if [ $PLAN_EXIT_CODE -eq 2 ]; then
  echo "Drift detected! Infrastructure does not match code."
  # Send alert to Slack or your incident management tool
  curl -X POST -H 'Content-type: application/json' \
    --data '{"text":"Drift detected in production environment. Run terraform apply to reconcile."}' \
    YOUR_SLACK_WEBHOOK_URL
  exit 1
elif [ $PLAN_EXIT_CODE -eq 1 ]; then
  echo "Error running terraform plan."
  exit 1
else
  echo "No drift detected. Infrastructure matches code."
fi

This script uses -detailed-exitcode to distinguish between a clean state (exit code 0), an error (exit code 1), and drift (exit code 2).

Building Drift Detection Into Your Pipeline

A practical drift detection pipeline looks like this:

Here is a visual overview of the drift detection pipeline:

flowchart TD A[Code Commit] --> B[Plan] B --> C[Apply] C --> D{Drift Check} D -- Periodic or Event-Triggered --> E{Drift Found?} E -- No --> D E -- Yes --> F[Alert Team] F --> G{Decision} G -- Revert --> H[Reconcile] G -- Accept --> I[Update Code] H --> D I --> D

Schedule regular checks. Run drift detection every few hours or daily, depending on how critical your infrastructure is. For production environments, more frequent checks are better. For staging environments, daily checks may be sufficient.
Trigger on suspicious events. Some cloud providers emit events when resources are modified outside of your pipeline. Use webhooks or event logs to trigger a drift check when these events occur.
Report the results. When drift is detected, send a notification to the team. Use Slack, email, or create an issue in your tracker. The notification should include which resources drifted and what the expected values are.
Let the team decide what to do. Not all drift is bad. A manual change might be legitimate and should be incorporated into the code. Another change might be accidental and needs to be reverted. The team needs to see the drift and make a judgment call.

Automatic Reconciliation: Proceed With Caution

Some teams take drift detection a step further and automate the correction. When the pipeline detects drift, it immediately runs apply to bring the infrastructure back to the desired state. This approach works well for environments that must stay consistent, like staging or tightly controlled production systems.

But automatic reconciliation has risks. Consider auto-scaling: your infrastructure as code defines a minimum and maximum number of instances. An auto-scaling group adds instances based on load. A drift detection run sees the extra instances as drift and terminates them. Your users experience an outage because the system was doing exactly what it was supposed to do.

The rule of thumb is: only automate reconciliation for resources that should never change outside the pipeline. For everything else, notify the team and let them decide.

A Practical Drift Detection Checklist

Here is a short checklist to help you get started with drift detection:

Choose a schedule for drift checks (every 6 hours for production, daily for staging)
Configure notifications to the right channel (Slack, email, or incident management tool)
Decide which environments get automatic reconciliation and which get manual review
Test the drift detection process by making a small manual change and verifying the alert
Review drift reports weekly as a team to identify patterns or recurring issues

What Drift Detection Gives You

Drift detection does not prevent manual changes. It does not eliminate the need for incident response. What it does is give you visibility into the gap between your code and reality. When you know about drift, you can decide whether to accept it, revert it, or update your code to reflect it.

Without drift detection, your infrastructure as code is a fiction. You might believe your system is reproducible, but you have no way to verify it. With drift detection, your code remains the single source of truth because you actively maintain that truth.

The next time someone says "we use infrastructure as code," ask them: "How do you know your infrastructure still matches your code?" If they cannot answer, they have drift. And drift is a time bomb waiting to go off during the next incident.