30-1 · Chapter 30 · 5 min read

Why Rolling Back Infrastructure Is Nothing Like Rolling Back an Application

You push a bad application update. Users start seeing errors. Your team swaps the load balancer back to the previous version, or the pipeline redeploys

Why Rolling Back Infrastructure Is Nothing Like Rolling Back an Application

You push a bad application update. Users start seeing errors. Your team swaps the load balancer back to the previous version, or the pipeline redeploys the old artifact. Within minutes, the app is running the old code. Data is intact. The database didn't change. The servers are the same machines they were before. The problem is gone.

That's application rollback. It works because applications are mostly stateless. You swap the code, and the old behavior comes back. No lasting side effects.

Now imagine you changed a firewall rule, resized a database disk, or updated a subnet configuration. Something breaks. You want to go back. Can you just reapply the old configuration and expect everything to work?

Probably not.

Infrastructure rollback is a different category of problem. The resources involved hold data, manage network paths, and provide foundational services that other systems depend on. When you change infrastructure, you are not just swapping code. You are altering the state of something that may have years of data, connections to dozens of services, or a role as the foundation for everything above it.

State Is the First Problem

Applications are designed to be disposable. You can kill a container, spin up a new one with old code, and the application starts fresh. No memory of the previous version. No leftover data from the failed deployment.

Infrastructure is the opposite. A database keeps its data even if you change its configuration. A disk volume keeps its files even if you resize it. These resources are stateful. They hold state that persists beyond the lifecycle of any configuration change.

When you roll back an application, you just restore the code. When you roll back infrastructure, you must restore the configuration without destroying the data that has accumulated inside the resource. That is not always possible.

Here is a concrete example. You upgrade a database instance type from small to large because traffic increased. The new instance type has a problem. You want to go back to the small instance. But during the time the large instance was running, more data was written to the database. The old small instance cannot fit that data anymore. Rolling back is not safe. You cannot shrink the instance without losing data, and you cannot keep the data if you switch back.

The difference becomes clear when you compare the two paths side by side.

flowchart TD A[Bad change deployed] --> B{Is it an app?} B -->|Yes| C[Swap code via LB or pipeline] C --> D[Old code runs, no side effects] D --> E[Rollback successful] B -->|No, it's infra| F[Identify stateful resources] F --> G{Can state be preserved?} G -->|No| H[Rollback may destroy data] G -->|Yes| I[Check dependencies] I --> J[Ordered rollback sequence] J --> K[Partial or broken state] K --> L[Recovery needed, not just rollback]

This is not a tool problem. This is a fundamental constraint of stateful resources. The act of running the new configuration changes the resource in ways that the old configuration cannot accommodate.

Dependencies Multiply the Risk

Infrastructure resources rarely exist alone. One change can touch ten interconnected resources: a VPC, a subnet, a security group, a load balancer, several instances, and a database. Each resource depends on the others in specific ways.

When you roll back one resource, the resources that depend on it are affected. Restoring an old security group can break communication between the load balancer and the instances. Restoring a subnet can cut the database connection. Rolling back is not a single operation. It is a sequence that must be ordered carefully, and the order depends on how the resources were originally created and how they have changed since.

In practice, this means you cannot just run terraform apply on the old state file and walk away. The old state may conflict with the current state of other resources that were not rolled back. The result is often a partial recovery that leaves your infrastructure in a broken state.

Idempotent Apply Does Not Mean Safe Rollback

Infrastructure pipelines are designed to be idempotent. You can run the same configuration multiple times and get the same result. That works well for applying changes. But idempotent apply does not mean safe rollback.

Consider disk size. You declare a disk of 100 GB, apply it, and the disk is created. You run the same configuration again, and nothing changes. That is idempotent. Now you change the configuration to 200 GB and apply it. The disk grows. Then you change the configuration back to 100 GB and apply it again. What happens?

Most infrastructure tools will either reject the operation or destroy the disk and create a new one. They cannot shrink a disk without risking data loss. The configuration is idempotent in theory, but the actual resource has changed in a way that cannot be reversed.

This is called state drift. The configuration in your code says one thing, but the actual resource in the cloud or on the server is different. When you try to roll back, you are not just reverting code. You are trying to reconcile a configuration that no longer matches reality. And reality often wins.

What This Means for Your Team

Infrastructure rollback requires a different kind of preparation than application rollback. You cannot rely on the same pipeline or the same mental model. You need to know which resources are safe to roll back, which are not, and what order to follow.

Some changes are reversible. Changing a load balancer health check configuration is usually safe to roll back. Changing a database parameter group might be safe if the new parameters did not alter stored data. But changing instance size, disk size, network topology, or storage engine is often irreversible without data loss or downtime.

The safest approach is to plan for recovery, not just rollback. Recovery means accepting that the old configuration may no longer be valid and building a path forward instead of backward. That might involve creating a new resource with the old configuration and migrating data, or accepting a degraded state while a fix is developed.

Practical Checklist for Infrastructure Changes

Before you apply any infrastructure change, ask these questions:

Does this resource hold state? If yes, can the state be preserved if we revert the configuration?
What other resources depend on this one? Will rolling back break their connectivity?
Is the change reversible? Can the tool shrink a disk, downgrade an instance, or restore a network path without destroying data?
What is the actual recovery plan if the change fails? Is it a rollback, a migration, or a rebuild?
Have we tested the recovery path in a non-production environment?

Takeaway

Application rollback is a code swap. Infrastructure rollback is a state reconciliation problem. Treating them the same way leads to broken systems, lost data, and long recovery times. Plan for recovery, not just rollback. Know which changes are reversible, and test your recovery path before you need it.