12-1 · Chapter 12 · 6 min read

Why You Need a Recovery Plan Before Your Next Deployment

You have just pushed a new version of your application to production. Within minutes, users start reporting that they cannot log in. Error rates spike

Why You Need a Recovery Plan Before Your Next Deployment

You have just pushed a new version of your application to production. Within minutes, users start reporting that they cannot log in. Error rates spike. The database response time triples. Your team chat explodes with messages. Someone asks: "Should we rollback?" Another person says: "Let me try to fix it directly on the server." A third person is silent because they are already running commands they have never tested before.

This scene plays out in teams of all sizes. The common thread is not the bug itself. It is the absence of a plan. When something goes wrong during a deployment, the team does not have time to think clearly. They have to act under pressure, with incomplete information, while users are waiting and managers are asking for updates. In that moment, the difference between a fast recovery and a prolonged outage often comes down to one thing: whether the team already decided what to do before the deployment started.

The Problem With Deciding During an Incident

When a deployment goes wrong, the natural instinct is to figure out what to do on the spot. Someone suggests rolling back to the previous version. Someone else wants to patch the issue directly on the production server. Another person argues that the team should wait and see if the problem stabilizes. These discussions waste precious minutes. Every minute of debate means more users affected, more errors logged, and more pressure on the team.

The bigger risk is that someone takes an unplanned action that makes things worse. Manually editing files on a production server, restoring a partial backup, or running a database command without testing can introduce new problems. What started as a broken login page can turn into data inconsistency, a corrupted database, or a complete service outage.

Teams that have not prepared a recovery plan are essentially gambling. They hope the deployment goes well, and if it does not, they hope someone in the room knows the right thing to do. That is not a strategy. It is wishful thinking.

What a Recovery Plan Actually Is

A recovery plan is not a thick document that sits in a shared drive and gets read once a year. It is a set of decisions made before the deployment, written down in a form that the team can execute under pressure. The plan answers specific questions:

Under what conditions do we stop the deployment and initiate recovery?
Who has the authority to make that call?
Do we rollback to the previous version, or do we roll forward with a fix?
What are the exact steps to execute the chosen recovery action?
How do we verify that the recovery worked?

For a small team, the plan might be a checklist with five steps. For a larger team with multiple services and dependencies, the plan might include coordination points, communication channels, and escalation paths. The complexity scales with the system, but the principle stays the same: decide before you deploy.

Why Preparation Matters

There are four reasons why a recovery plan must exist before the deployment, not after the problem appears.

First, time is not on your side during an incident. Every minute of downtime costs something: lost revenue, frustrated users, damaged reputation. If the team has to stop and think about what to do, the recovery time increases. A pre-defined plan removes the thinking step. The team executes known actions instead of inventing new ones.

Second, without a plan, different people will have different opinions about what to do. One engineer might want to rollback immediately. Another might want to investigate first. A third might want to apply a hotfix. These disagreements create delays and confusion. A recovery plan settles these questions in advance. Everyone knows what the default action is and who decides if the team should deviate from it.

Third, some recovery actions require preparation that cannot be done on the spot. Restoring a database to a previous state requires backups taken with the right format and retention policy. Rolling back a mobile application requires the previous version to be signed and ready for distribution. These preparations must be done before the deployment, not after the failure.

Fourth, a plan that has never been tested is just a theory. Teams should simulate failure scenarios and run through the recovery steps in a safe environment. This reveals gaps in the plan, missing permissions, outdated scripts, or assumptions that do not hold in practice. Testing the plan turns it from a document into a capability.

Recovery Is Not a Sign of Pessimism

Some teams resist creating recovery plans because they feel it signals a lack of confidence in their deployment process. That is the wrong way to think about it. A recovery plan is not an admission that you expect to fail. It is a recognition that complex systems have unpredictable behavior, and being prepared is the responsible thing to do.

Mature teams do not just focus on making deployments successful. They also prepare for the possibility that a deployment will not go as expected. They treat recovery as a normal part of the delivery process, not as an emergency procedure that only gets activated when things go badly.

Two Main Approaches: Rollback and Roll-Forward

Once you accept that a recovery plan is necessary, the next question is what kind of recovery to use. The two most common approaches are rollback and roll-forward.

Rollback means returning the system to the previous known-good state. You undo the deployment and go back to the version that was running before. This is the most straightforward approach when the problem is clear and the previous version is stable.

Roll-forward means deploying a new version that fixes the problem, rather than reverting to the old version. This approach is useful when the previous version has its own issues, when a rollback would cause data loss, or when the fix is small enough to deploy quickly.

Each approach has trade-offs. Rollback is simpler but may not be possible for all types of changes. Roll-forward keeps the system moving forward but requires a fix to be developed and tested under pressure. The right choice depends on the situation, which is exactly why the decision should be discussed and documented before the deployment.

The following flowchart summarizes the decision process and the steps for each recovery path:

flowchart TD A[Deployment fails] --> B{Is rollback safe?} B -->|Yes| C[Rollback] B -->|No| D[Roll-forward] C --> E[Revert code] E --> F[Revert DB] F --> G[Verify system] D --> H[Write fix] H --> I[Deploy fix] I --> J[Verify system]

Practical Checklist for Your Next Deployment

Before you deploy, run through this checklist with your team:

Have we agreed on the conditions that would trigger a recovery?
Do we know who decides whether to rollback or roll-forward?
Are the exact steps for recovery documented and accessible?
Have we tested the recovery steps in a staging environment?
Do we have the necessary backups, artifacts, and permissions ready?
Does everyone on the team know where to find the plan?

If you cannot answer yes to all of these, your deployment is not ready.

The Takeaway

A deployment without a recovery plan is not a deployment. It is a hope. The difference between a team that recovers in minutes and a team that spends hours in chaos is not technical skill. It is preparation. Decide what you will do before something goes wrong, write it down, test it, and make sure everyone knows the plan. That is how you turn recovery from a panic into a procedure.