When Infrastructure Changes Outside Your Pipeline: Drift Detection for Policy Compliance
You have written policies. You have automated checks in your CI/CD pipeline. Every deployment runs through validation before anything reaches production. Your team feels confident that infrastructure follows the rules.
Then someone on the team needs to debug a production issue at 10 PM. They log into the cloud console, open a security group port to their home IP, fix the issue, and go to bed. They forget to revert the change. The next morning, that security group is still open. Your pipeline never knew about it. Your policies never caught it.
This scenario is not rare. It happens when engineers take shortcuts during incidents, when another team creates resources manually because your pipeline feels too slow, or when auto-scaling launches new instances with default configurations that violate your policies. All these changes happen outside your CI/CD pipeline, so the policy checks you carefully built into your plan and apply steps never see them.
What Drift Detection Actually Means
Drift detection is the process of comparing your actual infrastructure state against what your policies say it should be. It runs periodically - every hour, every day, or whatever cadence makes sense for your team - and reports which resources have strayed from the rules.
The goal is not to prevent changes. Your pipeline policies already handle that by blocking non-compliant deployments before they happen. Drift detection catches changes that already occurred outside your pipeline. It is your safety net for the infrastructure you cannot control through automation alone.
How Drift Detection Works in Practice
The mechanics are straightforward. A tool or script calls your cloud provider's API to list all existing resources. Then it checks each resource against your defined policies one by one.
Here is a simple bash script that runs a Terraform plan and alerts if drift is detected:
#!/bin/bash
# Scheduled drift detection script (run via cron every hour)
cd /path/to/terraform/project
terraform init -input=false > /dev/null 2>&1
terraform plan -detailed-exitcode -input=false -no-color > plan_output.txt 2>&1
EXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ]; then
echo "Drift detected at $(date)" >> drift_alerts.log
# Send notification (e.g., Slack webhook)
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"Drift detected in production infrastructure. Check plan_output.txt for details.\"}" \
https://hooks.slack.com/services/YOUR/WEBHOOK/URL
elif [ $EXIT_CODE -eq 1 ]; then
echo "Terraform plan failed at $(date)" >> drift_alerts.log
fi
For example, your policy says no security group should have port 22 open to 0.0.0.0/0. Drift detection scans every security group and flags any violation. Or your policy requires an owner tag on every resource. Drift detection inventories all resources and marks those missing the tag.
The results need to go somewhere your team actually looks. A dashboard works. Slack or email notifications work. An automated ticket in your tracking system works. What does not work is generating a report that nobody reads. Someone must own the follow-up, or drift detection becomes noise.
Tools That Help You Detect Drift
Several tools already include drift detection capabilities. Terraform has terraform plan which can show differences between your state file and real infrastructure. But this only helps for resources managed through Terraform. Resources created outside Terraform need a different approach.
Cloud-native tools can scan broadly. AWS Config, Azure Policy, and Google Cloud Asset Inventory inspect resources using their native APIs. They understand the cloud provider's resource model deeply and can check compliance across everything in your account.
Open source options exist too. Open Policy Agent (OPA) can run as a policy engine that you trigger periodically against your infrastructure. You write policies in Rego, and OPA evaluates them against resource data you feed in.
The Hard Part: What To Do When You Find Drift
Finding drift is only half the work. The harder question is what happens next.
Good drift reports include enough information to fix the problem. They tell you which resource violated which policy and ideally how to correct it. Some teams go further and implement automatic remediation - when drift is detected, the system automatically reverts the resource to its compliant state.
Automatic remediation sounds great until it bites you. If an engineer opened a port temporarily while debugging a production incident, automatic remediation could close that port while they are still working. The tool would be fixing a policy violation while breaking an active investigation. You need to distinguish between drift that happened by accident and drift that happened intentionally for a short-term reason.
A practical approach is to start with alerting and manual remediation. Let the team know when drift occurs, give them the details, and let them decide whether to fix it immediately or document it as a temporary exception. Once you understand the patterns of drift in your environment, you can consider automation for the cases that are always wrong and never justified.
Three Layers of Protection
Drift detection completes your policy enforcement cycle. Think of it as three layers that cover each other's blind spots:
The diagram below illustrates how these three layers interact:
- Policy check at plan time - Prevents violations before they reach production
- Policy check at apply time - Catches anything that slipped through planning
- Periodic drift detection - Finds changes that happened outside the pipeline
No single layer is enough. Pipeline checks cannot catch manual console changes. Drift detection cannot prevent violations from happening in the first place. Together, they give you coverage that keeps your infrastructure aligned with your policies even when things change outside your automated workflows.
Practical Checklist
- Pick a cadence for drift scans based on how fast your infrastructure changes
- Choose tools that cover both managed and unmanaged resources in your environment
- Send drift reports to a channel your team actually monitors
- Assign ownership for following up on drift findings
- Start with manual remediation before considering automation
- Document temporary exceptions so your team knows which drifts are intentional
What This Means For Your Team
Your policies are only as strong as your ability to enforce them everywhere infrastructure changes. Pipeline checks handle the changes you control. Drift detection handles the changes you do not. Without both, your policy is a document that describes what should be true, not a mechanism that keeps it true. Build drift detection into your operations, and your infrastructure will stay compliant even when the unexpected happens.