29-7 · Chapter 29 · 7 min read

When Infrastructure Changes Outside Your Pipeline: A Drift Detection Exercise

You have a Terraform configuration that defines a security group. It has the right name, the correct inbound rules, and the proper tags. Your pipeline ran

When Infrastructure Changes Outside Your Pipeline: A Drift Detection Exercise

You have a Terraform configuration that defines a security group. It has the right name, the correct inbound rules, and the proper tags. Your pipeline ran successfully, the resource was created, and your state file is clean. Everything looks good.

Then someone logs into the cloud console and makes a small change. Maybe they rename the security group because it was confusing. Maybe they add an inbound rule to test something quickly. Maybe they remove a tag that seemed unnecessary. No code changes. No pipeline run. Just a manual tweak in the console.

Your infrastructure is now different from what your code says it should be. That difference is called drift. And if you don't know it happened, your next deployment could break things in ways you didn't expect.

What Drift Actually Looks Like

Drift happens when the actual state of your infrastructure diverges from the desired state defined in your code. It's not a theoretical problem. It happens all the time in real teams:

Someone fixes an urgent issue directly in production because the pipeline would take too long.
A cloud provider automatically rotates a certificate or changes a default setting.
A team member deletes a resource by accident while cleaning up something else.
An automated policy outside your pipeline modifies a resource for compliance reasons.

The problem isn't that drift exists. The problem is that you don't know about it until something breaks.

A Simple Exercise to See Drift in Action

You can simulate drift in your own environment with minimal setup. You need a cloud account with free tier resources, or you can use a local simulator like LocalStack. Even a mock state file will work for learning purposes.

Start by creating one resource with Terraform. A security group in AWS or a storage bucket in any cloud provider works well. Run your pipeline until the resource is created and the state file is saved. Make sure you can see the resource in the cloud console.

Here is a minimal Terraform configuration you can use to follow along:

# main.tf
provider "aws" {
  region = "us-east-1"
}

resource "aws_security_group" "web_sg" {
  name        = "web-server-sg"
  description = "Allow HTTP and SSH traffic"

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["10.0.0.0/8"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "web-server-sg"
    Env  = "test"
  }
}

Run terraform init and terraform apply to create the security group. Then, in the AWS console, manually rename the security group to web-server-sg-manual and remove the Env tag. Finally, run terraform plan to see the drift:

$ terraform plan
aws_security_group.web_sg: Refreshing state... [id=sg-0123456789abcdef0]

Terraform will perform the following actions:

  # aws_security_group.web_sg will be updated in-place
  ~ resource "aws_security_group" "web_sg" {
        id          = "sg-0123456789abcdef0"
      ~ name        = "web-server-sg-manual" -> "web-server-sg"
        tags        = {
          - "Env"  = "test" -> null
            "Name" = "web-server-sg"
        }
        # (6 unchanged attributes hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

The plan shows Terraform will revert the name and add back the missing tag. This is drift in action.

Now, without touching your Terraform code, open the cloud console and make a manual change. Here are a few things you can try:

Rename the security group.
Add an inbound rule that doesn't exist in your code.
Remove a tag that your code defines.
Change the bucket's public access setting.

The goal is to create a situation where the actual resource no longer matches your code. This is drift.

The following flowchart illustrates the sequence of events in this drift detection exercise:

flowchart TD A[Desired State: Terraform Config] --> B[Pipeline Creates Resource] B --> C[Actual State Matches Config] C --> D[Manual Change in Console] D --> E[Actual State Drifts from Config] E --> F[Run terraform plan] F --> G{Plan Shows Changes?} G -- Yes --> H[Drift Detected] G -- No --> I[No Drift] H --> J{Reconciliation Decision} J -- Accept Change --> K[Update Terraform Code] J -- Restore Config --> L[Run terraform apply] K --> M[Pipeline Runs, State Matches] L --> M

Running Terraform Plan After Drift

Once you've made the manual change, run terraform plan in your terminal. Look at the output carefully. Terraform will compare your state file against the actual resource and show you what it would change if you ran apply.

Notice something important: the plan might show changes you didn't expect. Maybe it wants to rename the security group back to the original name. Maybe it wants to remove the rule someone added. But it might also show changes to other resources that depend on the one you modified. A simple rename could affect a load balancer, a target group, or an IAM policy that references the security group by name.

This is why you can't blindly trust a plan after drift has occurred. The plan might be correct, but you need to verify each proposed change. A plan that looks clean on the surface could hide cascading effects that break other parts of your infrastructure.

Detecting Drift Explicitly

Terraform has a command called terraform refresh that updates your state file with the actual state of your resources. Run it, then run terraform plan again. You'll see the same drift, but now your state file reflects reality. This is useful for understanding what changed, but it doesn't fix the drift. It just acknowledges it.

Some platforms like Spacelift or Terragrunt have built-in drift detection that runs on a schedule. They can notify you when drift is detected, and some can even trigger automatic reconciliation. But for this exercise, manual detection is enough to understand the mechanics.

Write down which resources drifted and what changed. This record will help you think about the next step.

Making a Reconciliation Decision

Now you have a choice. You know the infrastructure has drifted. You know what changed. What do you do about it?

Ask yourself these questions:

Was the manual change intentional? Did someone make it for a legitimate reason, like fixing an urgent issue?
Is the change still needed? Maybe the emergency is over and the resource should return to its original state.
Is it safe to revert the change? Reverting might break something that now depends on the new configuration.
Does anyone on the team know why the change was made? Is there a record in a ticket or a chat log?

If you decide to reconcile, run terraform apply. The resource should return to the state defined in your code. Verify that everything works as expected.

If you decide to adopt the change, update your Terraform code to match the actual state. Then run your pipeline normally. The drift is now resolved because your code and your infrastructure agree again.

Variations to Try

Once you understand the basic scenario, try more complex variations:

Make a temporary change, like increasing instance capacity during a traffic spike. See how drift detection catches it after the spike is over.
Change a resource that has dependencies, like modifying a load balancer configuration that connects to a target group. Watch how the plan shows cascading effects.
Create a change that is hard to revert, like deleting a resource that other resources depend on. See how Terraform handles the dependency chain.

Each variation teaches you something about how drift behaves in real systems. The more complex the scenario, the clearer it becomes that drift detection and reconciliation require careful thought, not just automation.

A Practical Checklist for Drift Management

Before you move on, here is a short checklist to apply in your own environment:

Set up automated drift detection on your critical infrastructure resources.
Define a clear process for handling drift: who gets notified, how decisions are made, and when reconciliation is triggered.
Keep a record of manual changes, even temporary ones, so the team knows why drift exists.
Test your reconciliation process in a non-production environment before applying it to production.
Review drift reports regularly, not just when something breaks.

What This Exercise Teaches You

Drift is not a theoretical concept. It is a real operational problem that every team faces when infrastructure is managed through code. The exercise shows you that drift can happen silently, that it can affect more than just the changed resource, and that reconciliation decisions always depend on context.

You cannot prevent all drift. People will make manual changes. Cloud providers will modify resources. Emergencies will happen. What you can do is detect drift early, understand its impact, and make informed decisions about whether to revert or adopt the change.

The next time someone says "I just made a quick fix in the console," you will know exactly what that means for your infrastructure. And you will have a process ready to handle it.