27-7 · Chapter 27 · 6 min read

When Your Infrastructure State Doesn't Match Reality

You set up your infrastructure as code. Terraform, Pulumi, or whatever tool you chose. Everything is tracked, versioned, and repeatable. Your state file

When Your Infrastructure State Doesn't Match Reality

You set up your infrastructure as code. Terraform, Pulumi, or whatever tool you chose. Everything is tracked, versioned, and repeatable. Your state file says the production server has 4 CPUs and 16 GB of RAM. Life is good.

Then one day, someone logs into the cloud console and resizes that server manually. Maybe there was a performance issue, and the support team needed to act fast. Maybe someone ran a script that wasn't in your codebase. Whatever the reason, your state file still says 4 CPUs and 16 GB, but the actual server is now running with 8 CPUs and 32 GB.

That gap between what your state says and what actually exists is called drift. And it's a bigger problem than most teams realize.

Why Drift Happens

Drift isn't rare. It happens more often than you'd think, and usually for understandable reasons:

Someone makes a quick change in the cloud console during an incident
A different team runs their own automation that touches your resources
A monitoring tool auto-scales something without updating the state
A developer modifies a resource directly to test something and forgets to revert

The intent is almost never malicious. But the result is the same: your state becomes unreliable. And when state is unreliable, every subsequent deployment becomes a gamble. You might try to update a resource that no longer matches your configuration. Or you might discover that resources you thought existed have been modified or deleted. The next time you run your pipeline, you get surprises instead of predictable outcomes.

The Real Cost of Drift

Drift doesn't just break your automation. It breaks trust in your entire delivery process. When your team can't trust that infrastructure as code actually reflects reality, they start making manual changes again. And manual changes lead to more drift. It's a downward spiral.

For production environments, drift is especially dangerous. A resource that was modified outside your process might behave differently under load. Security groups that were changed manually might leave gaps. Database instances that were resized without updating state might cause unexpected costs or performance issues. And when something goes wrong, you have no reliable record of what actually changed.

Detecting Drift: The Simple Way

The most basic approach to drift detection is manual comparison. With Terraform, running terraform plan shows you the difference between your code, your state, and the actual infrastructure. Any resource that was changed outside your code will appear as an unexpected modification.

Here is the command to run and what to look for:

# Run terraform plan with no code changes to detect drift
terraform plan

# Example output showing drift (no code changes were made)
# Terraform will perform the following actions:
#
#   # aws_instance.web_server will be updated in-place
#   ~ resource "aws_instance" "web_server" {
#       ~ instance_type = "t3.large" -> "t3.medium"
#         id            = "i-0abcd1234efgh5678"
#         tags          = {}
#         # (12 unchanged attributes hidden)
#     }
#
# Plan: 0 to add, 1 to change, 0 to destroy.

# Any changes shown when you haven't modified your code = drift

This works for occasional checks. But manual detection has a fundamental problem: you only find drift when you look for it. If you check once a week, drift can exist for days before you notice. And in that time, it can cause real problems.

Automating Drift Detection

For environments that need consistent control, drift detection should run automatically. Many teams set up scheduled pipelines that run terraform plan or equivalent commands on a regular basis. When drift is detected, the pipeline sends a notification to the team.

Some tools have this built in. Terraform Cloud and Atlantis both offer automated drift detection. Pulumi has similar capabilities. But even without these tools, you can set up a simple cron job or scheduled CI pipeline that runs your infrastructure validation and alerts when things don't match.

Automated detection is especially important for production environments. Drift in production doesn't wait for your weekly check. It affects users immediately.

What To Do When You Find Drift

Detection is only half the problem. Once you know drift exists, you need to decide what to do about it. You have two main options:

The following flowchart summarizes the two main paths for handling drift:

flowchart TD A[Drift Detected] --> B{Was the change intentional?} B -->|No| C[Reconcile to code] B -->|Yes| D{Should it become the new standard?} D -->|No| C D -->|Yes| E[Update state to match reality] C --> F[Run apply to restore desired state] E --> G[Import resource & update code] F --> H[Verify no remaining drift] G --> H H --> I[Document the change]

Option 1: Reconcile back to your code. Run your configuration again to bring the infrastructure back to the desired state. This is the safest choice for production environments. It reinforces that your code is the source of truth, and manual changes won't persist.

Option 2: Update your state to match reality. Import the current infrastructure into your state, then update your code to match. This makes sense when the manual change was intentional and should become the new standard. But be careful: accepting drift into your state means you're accepting that infrastructure can be changed outside your defined process.

Most mature teams choose option one for production. They reconcile the infrastructure back to the desired state. This practice is called reconciliation, and it's the core idea behind tools like Kubernetes operators and GitOps workflows. The system continuously checks that reality matches the desired state, and automatically corrects any drift it finds.

Building a Drift Detection Practice

If you're setting up drift detection for the first time, start simple. Run a scheduled plan against your most critical environments. Send the results to a chat channel where the team can see them. Make drift visible before you try to automate the response.

Once the team is used to seeing drift notifications, start automating the response for non-production environments. Let the pipeline automatically reconcile staging and development environments. For production, keep the human in the loop until you're confident in your automation.

And always keep this in mind: drift detection is not just about catching mistakes. It's about maintaining trust in your infrastructure as code. When your team knows that state is accurate, they can make changes with confidence. When they don't, everything slows down.

Practical Checklist

Run terraform plan or equivalent on a schedule for production environments
Send drift notifications to a team channel
Define a clear policy: reconcile to code or update state
Automate reconciliation for non-production environments first
Document how to handle intentional manual changes
Review drift patterns monthly to identify process gaps

The Takeaway

Drift is not a failure of your tools. It's a signal that your process has a gap. Someone needed to make a change, and the defined process didn't work for them. Maybe it was too slow. Maybe they didn't have access. Maybe they didn't know the process existed.

When you find drift, don't just fix the infrastructure. Fix the process that allowed the drift to happen. Make it easier for people to make changes through the right path than around it. That's how you build a system that stays consistent without constant vigilance.