45-4 · Chapter 45 · 6 min read

Why Your Infrastructure Changes Need the Same Discipline as Code Changes

Imagine this: someone on your team needs to open a port for a new service. They log into the cloud console, add a firewall rule, and move on. Five minutes

Why Your Infrastructure Changes Need the Same Discipline as Code Changes

Imagine this: someone on your team needs to open a port for a new service. They log into the cloud console, add a firewall rule, and move on. Five minutes later, the entire production application becomes unreachable. The new rule accidentally blocked traffic to the database. No one knows what changed, who changed it, or how to revert it quickly.

This scenario is more common than most teams admit. Infrastructure changes - firewall rules, load balancer configs, storage policies, network settings - don't happen every day. But when they go wrong, they take down everything. A single misconfigured security group can make your application invisible to users. One wrong DNS change can redirect traffic to nowhere.

The problem is that many teams treat infrastructure changes differently from application code changes. Code changes go through pull requests, code review, automated tests, and staged rollouts. Infrastructure changes often happen through direct console access, ad-hoc commands, or shared credentials. The gap in discipline creates a blind spot that eventually causes an outage.

Infrastructure as Code Is the Foundation, Not the Solution

Infrastructure as Code (IaC) means you write your infrastructure configuration in files, store them in a repository, and apply them through automation. Tools like Terraform, Pulumi, or AWS CDK make this possible. But having IaC files in a repo is not enough. The discipline comes from how you manage changes to those files.

If someone can push a change to the main branch and have it applied to production without review, you have the same problem as the cloud console scenario. The tool gives you repeatability, but it does not give you safety. Safety comes from process.

A Template for Infrastructure Changes

Every infrastructure change should follow a repeatable sequence. This sequence works regardless of which IaC tool you use. It protects your team from the most common failure patterns.

The following flowchart illustrates the recommended sequence and decision points:

flowchart TD A[Start with Code Change] --> B[Run Plan] B --> C{Plan Review} C -- Approved --> D[Test in Non-Production] C -- Rejected --> A D --> E{Tests Pass?} E -- Yes --> F[Apply Through Pipeline] E -- No --> A F --> G[Verify After Apply] G --> H{Verification OK?} H -- Yes --> I[Change Complete] H -- No --> J[Execute Rollback Plan] J --> A

Start with a Code Change

Every infrastructure change must begin as a pull request to your infrastructure repository. No exceptions. No one should modify production infrastructure directly through a cloud console, a CLI command on a server, or a manual script.

The pull request shows exactly what changed: a new resource, a modified network configuration, a deleted storage bucket. Team members can review the diff, ask questions, and spot problems before anything runs. Require at least one reviewer who understands the impact of the change. For critical infrastructure like networking or security groups, consider requiring two reviewers.

Run a Plan Before You Apply

IaC tools can show you what will change without actually making the change. Terraform calls this a plan. Pulumi calls it a preview. The concept is the same: the tool compares your configuration against the current state and lists every resource that will be created, modified, or destroyed.

Run this plan as part of your pull request process. Save the output and attach it to the PR. Reviewers should check for unexpected changes: Is a resource being deleted that should not be? Is a change being applied to the wrong environment? If the plan shows something surprising, stop and investigate before proceeding.

For example, a Terraform plan for a security group change might look like this:

Terraform will perform the following actions:

  # aws_security_group_rule.app_ingress will be updated in-place
  ~ resource "aws_security_group_rule" "app_ingress" {
        id                     = "sgrule-1234567890"
      ~ from_port              = 8080 -> 8443
        protocol               = "tcp"
      ~ to_port                = 8080 -> 8443
        type                   = "ingress"
        # (5 unchanged attributes hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

This output shows exactly which rule will change and how, allowing reviewers to catch mistakes before the change is applied.

Test in a Non-Production Environment First

Apply the change to a staging or development environment before production. If you do not have a full infrastructure staging environment, create a smaller replica that mirrors the critical parts of production. Some teams run a separate infrastructure account or project specifically for testing changes.

If a staging environment is absolutely impossible, at least run the plan against production in read-only mode. This gives you visibility into what would change without actually changing anything.

Apply Through a Pipeline, Not a Laptop

The actual apply command should run in a CI/CD pipeline, not on a developer's machine. The pipeline records who triggered the apply, when it happened, and what changed. This audit trail is essential for debugging and compliance.

The pipeline should stop immediately if the apply fails partway through. Do not let it continue applying changes to other resources after a failure. Partial infrastructure changes are difficult to diagnose and even harder to fix.

Have a Rollback Plan

Before you apply, know how you will revert the change if something goes wrong. For immutable infrastructure, this means destroying the new resources and recreating the old ones from a previous state. For mutable infrastructure, take a snapshot or backup of the configuration before making changes.

Store your infrastructure state files in a versioned backend. Some teams keep the last known good state file so they can restore it quickly. The rollback plan should be documented and tested, not invented in the middle of an incident.

Verify After Apply

Do not assume everything is fine just because the apply completed without errors. Check that new resources are running. Test that network connections work. Confirm that applications depending on the infrastructure are still healthy.

Automate this verification as much as possible. A simple script that checks resource status, pings endpoints, or runs connectivity tests can catch problems that the apply command does not report.

A Practical Checklist for Infrastructure Changes

Change starts as a pull request in the infrastructure repository
At least one reviewer who understands the impact has approved the PR
Plan output has been reviewed and matches expectations
Change has been applied to a non-production environment first
Apply runs through a pipeline, not from a local machine
Rollback plan is documented and ready
Verification checks pass after apply

This checklist is not bureaucracy. It is protection against the kind of outage that starts with a single click in a cloud console and ends with a team scrambling to understand what happened.

The Takeaway

Infrastructure changes are high-risk, low-frequency operations. That combination makes them dangerous. When you do something rarely, you are more likely to make mistakes. When the impact is broad, those mistakes hurt more.

Treat infrastructure changes with the same rigor as application code changes. Pull requests, reviews, plans, staged environments, pipeline execution, and verification are not optional extras. They are the minimum process for keeping your infrastructure stable. The tool you use matters less than the discipline you apply.