Rollback: When Going Back Is Not as Simple as It Sounds
You just deployed a new version of your application. Five minutes later, errors start appearing in the monitoring dashboard. Users are reporting problems. Your first instinct is obvious: put the old version back. That is rollback in its simplest form -- revert to the last known stable state and buy yourself time to figure out what went wrong.
For many teams, rollback is the default recovery strategy. It makes intuitive sense. If the new version broke things, the old version worked fine. Just swap them back. But the reality is more complicated. Rollback works differently depending on what you are reverting: application code, database schema, or infrastructure configuration. Each has its own mechanics, risks, and limitations.
Application Rollback: The Relatively Easy Case
Rolling back application code is the most straightforward scenario. You have a running instance of your application, and you replace the new version with the old one. How you do this depends on your deployment strategy.
If you use blue-green deployment, rollback means switching traffic back to the environment that still runs the old version. The green environment becomes active again, and the blue environment is taken out of rotation. This switch can happen in seconds because both environments are already running and ready.
If you use canary releases, rollback means stopping the flow of traffic to the new version and routing everything back to the previous version. The canary is killed, and the stable version handles all requests again.
If you use straightforward rolling updates, rollback means redeploying the previous artifact to the same servers or containers. This takes longer because each instance needs to be updated one by one, but the process is still predictable.
For example, if you are using Kubernetes, a single command can revert a deployment to its previous revision:
kubectl rollout undo deployment/my-app -n production
This command tells Kubernetes to scale down the new pods and scale up the old ones, effectively reversing the rolling update. You can also specify a particular revision if you need to go back further than one step:
kubectl rollout undo deployment/my-app -n production --to-revision=3
To see the history of revisions before rolling back:
kubectl rollout history deployment/my-app -n production
The key advantage of application rollback is that it does not change data. You are only changing which code handles incoming requests. The database remains untouched, and no user data is lost or transformed. This makes application rollback relatively safe and fast.
The following flowchart maps the decision paths for each rollback type discussed in this section.
Database Rollback: Where Things Get Messy
Database rollback is a different beast entirely. Databases store state that changes continuously. When a new application version modifies the database schema -- adding a column, renaming a table, changing a data type -- rolling back the application code alone is not enough. You must also revert the database structure to its previous state.
This is where the complexity multiplies. Consider a simple scenario: your new version adds a column called phone_number to the users table. The application starts writing phone numbers into that column. After an hour, you discover a critical bug and decide to rollback. You deploy the old application code, but the old code does not know about the phone_number column. More importantly, the data already written into that column needs to be handled. Do you delete it? Do you move it somewhere else? Do you leave it and hope the old code ignores it?
The safest approach is to make every database migration reversible from the start. This means each migration script includes both an up step that applies the change and a down step that reverts it. When you rollback, you run the down migration to restore the previous schema.
But not all changes are truly reversible without data loss. If your new version deleted a column that was still in use, rolling back means recreating that column and restoring its data from a backup. If your new version merged two tables into one, rolling back means splitting them apart and figuring out which rows belonged to which table. These operations are risky, time-consuming, and often require manual intervention.
Many teams accept this reality and choose to avoid database rollback altogether. Instead, they invest heavily in testing migrations before deployment, running them against staging environments that mirror production data as closely as possible. When something does go wrong, they prefer to write a forward fix rather than attempt a backward revert.
Infrastructure Rollback: The Hidden Dependency Web
Infrastructure rollback means reverting changes to servers, networking rules, load balancers, or supporting services. If you manage infrastructure with tools like Terraform, Ansible, or Pulumi, rollback typically involves applying a previous version of your configuration files.
The challenge here is that infrastructure changes rarely affect just one thing. A change to a firewall rule might break database connectivity. A change to a load balancer configuration might affect traffic routing for multiple services. Rolling back a Terraform state file might delete resources that the new version created, which could cascade into other problems.
Infrastructure rollback also takes time. Applying a previous configuration requires running the same provisioning processes that created the infrastructure in the first place. If your infrastructure is large or complex, this could take minutes or even hours -- time that your users are experiencing errors.
The Limits of Rollback as a Strategy
Rollback is not a universal safety net. It only works when three conditions are met:
First, the old version must still be stable and compatible with the current system state. If your new version ran for hours and users entered data that the old version cannot read, rolling back will cause data loss or corruption.
Second, the problem must be in the code or configuration, not in the data. If the issue is that users are misusing a feature or that data quality has degraded, rolling back the code will not fix anything.
Third, the rollback must be faster than the time it takes to fix forward. If rolling back takes thirty minutes but writing a hotfix takes ten, rollback is the slower option.
There is also a behavioral risk. Teams that rely too heavily on rollback can become careless about pre-deployment testing. The mindset becomes "if it breaks, we will just rollback." This is dangerous because rollback has real costs. Users who saw errors or lost data do not care that you recovered in five minutes. Their trust has already been damaged.
Practical Checklist Before Deciding to Rollback
Before you execute a rollback, ask these questions:
- Is the old version still running and ready to accept traffic?
- Has the database schema changed in a way that makes the old code incompatible?
- How long has the new version been live, and how much user data has been entered?
- Can the database migration be reversed without data loss?
- Is the rollback faster than writing a forward fix?
- Have you communicated the rollback plan to the team and stakeholders?
If the answer to any of these raises a red flag, consider roll-forward instead.
The Takeaway
Rollback is a legitimate recovery strategy, but it is not a free undo button. Application rollback is relatively safe and fast. Database rollback is risky and often irreversible. Infrastructure rollback is slow and can have cascading effects. Before you make rollback your default response, understand what you are reverting and what the real cost will be. Sometimes the better move is to fix forward -- and that is what we will look at next.