29-6 · Chapter 29 · 6 min read

When Infrastructure Changes Outside Your Pipeline: Drift, Policy, and Practical Governance

Imagine this: you're on call at 2 AM. A production incident is unfolding, and someone on the team needs to open a security group port temporarily to debug

When Infrastructure Changes Outside Your Pipeline: Drift, Policy, and Practical Governance

Imagine this: you're on call at 2 AM. A production incident is unfolding, and someone on the team needs to open a security group port temporarily to debug a connectivity issue. They log into the cloud console, make the change, and the incident gets resolved. Everyone breathes a sigh of relief.

Three weeks later, during a routine security audit, you discover that port is still open to the entire internet. Nobody remembered to close it. Nobody documented the change. And your infrastructure-as-code pipeline still thinks that security group is locked down.

This is infrastructure drift. It happens in every organization that runs real systems. The question isn't whether it will happen, but how you handle it when it does.

Why Banning Manual Changes Doesn't Work

The obvious answer seems to be: just forbid all manual changes. Everything must go through the pipeline. No exceptions.

In practice, this approach fails for three reasons.

First, emergencies happen. When a production system is down, waiting for a pipeline run that takes 15 minutes is not acceptable. Engineers will find a way to make the change directly, and they should be able to.

Second, not all changes are equal. Adding a tag to a resource or updating a description is fundamentally different from modifying a database security group. Treating them the same creates unnecessary friction for low-risk changes.

Third, enforcement is hard. You can't physically prevent someone with cloud console access from clicking a button. And if you remove console access entirely, you create bottlenecks for legitimate troubleshooting.

The better approach is not to ban manual changes, but to govern them.

Policy as Code: Rules That Actually Enforce Themselves

Most organizations have policies about infrastructure changes. They're usually written in a document somewhere: "All changes to production must go through the change management process." But documents don't enforce anything. They just sit there.

Policy as code changes this. Tools like Open Policy Agent (OPA) or HashiCorp Sentinel let you write rules in code and attach them to enforcement points. When someone tries to modify a resource through the cloud console, or when an API call comes in directly, the policy engine evaluates the request against your rules before allowing it.

Here's a concrete example. You write a policy that says: "No security group may have port 22 open to 0.0.0.0/0 unless the change comes through the approved pipeline." When an engineer, panicking during an incident, tries to open SSH to the world from the AWS console, the policy blocks the change. Or at minimum, it logs the attempt and requires additional approval.

For instance, here is a HashiCorp Sentinel policy that requires every resource to have a managed-by tag, ensuring changes outside the pipeline are traceable:

# Require all resources to have a "managed-by" tag
import "tfplan/v2" as tfplan

# Get all resources that will be created or updated
all_resources = tfplan.resource_changes.all

# Rule: every resource must have a "managed-by" tag
mandatory_tag = "managed-by"

main = rule {
  all all_resources as _, rc {
    rc.applied.tags[mandatory_tag] else null != null
  }
}

This policy integrates into your CI/CD pipeline as a guard. If a resource is deployed without the tag, the pipeline fails, preventing untracked changes from reaching production.

This approach gives you two things. First, clear boundaries that don't rely on human discipline. Second, automatic audit trails. Every policy decision is recorded: who tried to change what, when, from where, and whether it was allowed or denied.

Designing Policies That Don't Break Under Pressure

A common mistake is making policies too rigid. If your policy blocks every manual change without exception, you'll create a situation where engineers find ways to bypass it entirely. Or worse, they'll be afraid to act during real emergencies.

The solution is to design policies with operational reality in mind.

One pattern is the break-glass mechanism. In certain defined emergency situations, an engineer can override a policy. The override is logged with a reason, and after the incident, the team reviews whether the change should be adopted into IaC or rolled back. This gives you safety without paralysis.

Another pattern is classifying changes by risk. Low-risk changes like adding tags, updating descriptions, or modifying non-critical configuration can be allowed freely. High-risk changes like modifying network access, security policies, or database configurations must go through the pipeline. The policy engine enforces this distinction automatically, so you don't need a human to decide each time.

Connecting Policy to Drift Detection

Policy and drift detection work best when they're connected. Here's how the flow looks in practice.

Your drift scanner runs regularly and finds differences between your IaC definitions and the actual infrastructure. Instead of flagging every difference as a problem, the system checks each difference against your policies.

The following diagram illustrates how policy enforcement and drift detection interact in a continuous governance loop.

flowchart TD A[IaC Pipeline] -->|Deploy| B[Infrastructure] C[Manual Change] -->|Emergency or direct action| B B -->|Drift Scanner| D{Drift Detected?} D -->|Yes| E[Policy Engine] D -->|No| F[No Action] E --> G{Policy Check} G -->|Approved Exception| H[Mark as Known & Allowed] G -->|Violation| I[Flag for Remediation] G -->|No Match| J[Queue for Manual Review] H --> K[Post-Incident Review] I --> K J --> K K -->|Adopt into IaC| A K -->|Roll back| B

If the change was made through an approved policy exception, it's marked as "known and allowed." If the change violates a policy, it's flagged for immediate remediation. If the change doesn't match any policy rule, it goes into a queue for manual review.

This changes how your team sees drift. Drift is no longer just a mistake to fix. It's a signal to interpret. Is this a legitimate emergency change that hasn't been adopted into code yet? Is this a policy violation? Is this a change that should have gone through the pipeline but didn't?

With policy and governance in place, you can distinguish these cases without investigating each one manually. And because policies are written as code, they can be reviewed, versioned, and tested like any other code. They're not dusty documents in a shared folder.

Practical Checklist for Policy-Driven Governance

If you're setting up policy and governance for infrastructure changes, here's a short checklist to work through:

Identify the types of changes that are most risky in your environment (security groups, database configs, IAM roles)
Write policies that block those changes outside the pipeline, with clear exception paths
Implement a break-glass mechanism for emergencies, with mandatory post-incident review
Connect your policy engine to your drift detection tooling
Classify changes by risk level and apply different rules accordingly
Set up automatic audit logging for every policy decision
Review policy exceptions regularly and adopt them into IaC or roll them back

The Takeaway

Infrastructure drift is not a problem you solve once. It's a condition you manage continuously. The goal is not to eliminate all manual changes, but to know about them, govern them, and decide which ones should become permanent parts of your infrastructure.

Policy as code gives you a way to enforce boundaries without blocking legitimate work. Drift detection gives you visibility into what's actually happening. Together, they turn infrastructure management from a reactive firefight into a system you can trust, even when things go wrong at 2 AM.