44-6 · Chapter 44 · 6 min read

When Infrastructure Is the Product: IaC Governance and Drift Detection

Imagine you're responsible for hundreds or thousands of servers spread across multiple cloud providers and regions. Your company might be a cloud service

When Infrastructure Is the Product: IaC Governance and Drift Detection

Imagine you're responsible for hundreds or thousands of servers spread across multiple cloud providers and regions. Your company might be a cloud service provider, a large e-commerce platform, or a tech company with infrastructure deployed in Singapore, Frankfurt, and Virginia simultaneously.

In this environment, infrastructure isn't just where applications run. Infrastructure is the product. Every configuration change--adding a firewall rule, resizing a database instance, tweaking a load balancer parameter--can affect dozens of services at once. One wrong configuration doesn't break one application. It breaks hundreds of services that depend on that shared infrastructure.

This is a different world from deploying a single web application. The stakes are higher, the blast radius is wider, and the margin for error is near zero.

The Consistency Problem That Leads to IaC

When infrastructure is managed manually through SSH sessions or cloud dashboards, inconsistencies creep in fast. Your staging environment has slightly different firewall rules than production. Your production setup in Asia differs from Europe. Nobody can say with confidence what configuration is actually running right now.

This is the moment where teams reach for Infrastructure as Code (IaC). You write all infrastructure configuration as code, store it in Git, and apply it automatically through pipelines. Terraform, Pulumi, AWS CDK, or CloudFormation become your primary tools. Every server, network rule, and storage bucket is defined in version control.

But writing configuration as code is only the first step. Once your infrastructure lives in code, a new question emerges: "How do we make sure every change follows our policies before it reaches production?"

IaC Governance: Automated Policy, Not Bureaucracy

Governance sounds like a word that slows things down. In practice, IaC governance is the opposite. It's automated guardrails that check every change before it hits production, without requiring a human to read through every line of configuration.

Here's how it works in practice. Your security team decides that all storage buckets must be encrypted. Your compliance team mandates that all database instances use a specific instance type. Your networking team requires that no firewall rule opens port 22 or 3306 to the public internet.

The following diagram illustrates the automated governance pipeline from code commit through deployment and drift detection:

flowchart TD A[Commit IaC Code] --> B[Run Policy Checks] B --> C{Policy Pass?} C -->|Yes| D[Deploy to Infrastructure] C -->|No| E[Block Change & Notify] E --> A D --> F[Monitor for Drift] F --> G{Drift Detected?} G -->|No| H[Infrastructure Stable] G -->|Yes| I[Alert Team] I --> J{Auto-Remediate?} J -->|Yes| K[Reconcile to Code State] J -->|No| L[Manual Review] K --> F L --> M[Update Code or Accept Drift] M --> A

These rules get written as automated policies that run inside your CI/CD pipeline. When someone submits a pull request that changes infrastructure code, the pipeline doesn't just apply the change. It first checks every resource against your policies. If a change violates a policy, the pipeline fails. The change never reaches production.

For example, here's a simple Open Policy Agent (OPA) rule that enforces a tagging standard, requiring every resource to have a cost-center tag:

package terraform

# Deny any resource that doesn't have a 'cost-center' tag
violation[msg] {
  resource := input.resource_changes[_]
  resource.type in ["aws_s3_bucket", "aws_instance", "aws_db_instance"]
  not resource.change.after.tags.cost-center
  msg := sprintf("%v %v is missing required tag 'cost-center'", [resource.type, resource.address])
}

When this policy runs in your CI/CD pipeline, any resource change that lacks the required tag will cause the pipeline to fail, preventing the misconfigured infrastructure from ever reaching production.

This isn't about adding approval gates for the sake of control. It's about catching problems before they become incidents. A misconfigured storage bucket that's publicly readable doesn't become a data breach. A database instance that's too small doesn't cause a performance outage. The policy catches it in the pipeline, and the team fixes it before it ever runs.

Drift: When Reality Diverges from Code

IaC gives you a single source of truth for your infrastructure. But that truth only holds if what's actually running matches what's in your code. In practice, this alignment breaks all the time.

Drift happens when the real-world infrastructure configuration differs from what's defined in your IaC code. Someone logs into the cloud dashboard during an incident and changes a security group rule manually. A team member adds a load balancer listener directly through the console because they needed it urgently. An automated process from another team modifies a resource without going through your pipeline.

Drift is dangerous because it makes your IaC untrustworthy. If your code says a server has configuration A, but reality has configuration B, then recovery, scaling, or creating a new environment will produce unexpected results. You can't rebuild infrastructure from code if the code doesn't match reality.

Drift detection solves this by periodically comparing your actual infrastructure state against your code. The pipeline runs the same commands it uses to apply infrastructure, but in "plan" or "preview" mode. It doesn't change anything. It just compares. When it finds differences, it reports them.

Some teams take this further. When drift is detected, the pipeline can automatically reconcile the infrastructure back to the code-defined state. Others prefer to notify the responsible team and let them decide whether to update the code or accept the drift as intentional.

Handling High-Risk Changes

Some infrastructure changes carry more risk than others. Changing a production database configuration, modifying a firewall rule that affects primary traffic, or altering a load balancer that serves millions of requests--these need extra care.

For these changes, you need more than automated policies. You need integrated approval processes that happen inside the pipeline, not outside it. The workflow looks like this:

A developer creates a pull request with the infrastructure change.
Automated policies run and check for violations.
If policies pass, the pipeline notifies the relevant reviewers--maybe a DBA for database changes, or a network engineer for firewall changes.
Reviewers approve or request changes directly in the pull request.
Only after all approvals are met does the change get applied.

The key insight is that approval doesn't mean slowing down. It means having the right people review the right changes at the right time, with all the context available in the pipeline. No chasing people on chat. No last-minute approvals before a release window closes.

Practical Checklist for IaC Governance

Write automated policies for every security and compliance rule that applies to your infrastructure
Run policy checks in your pipeline before any infrastructure change is applied
Set up drift detection to run on a schedule (daily or weekly depending on your risk tolerance)
Decide whether to auto-remediate drift or notify the team
Define which changes require additional approval and who the approvers are
Make approval part of the pipeline, not a separate process outside it

The Takeaway

When infrastructure is your product, consistency and control aren't optional. IaC gives you the foundation, but governance and drift detection turn that foundation into something you can trust. Automated policies catch problems before they reach production. Drift detection tells you when reality has diverged from your code. And integrated approvals ensure that high-risk changes get the right attention without becoming bottlenecks.

The goal isn't to slow down infrastructure changes. It's to make them safe enough that you can move fast without fear.