12-4 · Chapter 12 · 6 min read

What Happens After Rollback: Verifying Your Recovery Actually Worked

You just hit the rollback button. The deployment that caused errors, slow responses, or database corruption is gone. Your application is back on the

What Happens After Rollback: Verifying Your Recovery Actually Worked

You just hit the rollback button. The deployment that caused errors, slow responses, or database corruption is gone. Your application is back on the previous version. Your team breathes a sigh of relief.

But is the system actually working?

Rolling back or rolling forward is not the end of an incident. It is the middle. The real question is whether the recovery itself introduced new problems, left behind残留 data, or broke integrations with other services. Without proper verification after recovery, you might declare the incident resolved while the system is still broken in ways that only surface hours later.

The Hidden Risks of Recovery

When you roll back an application, you are not simply undoing time. The old version returns, but the environment may have changed. Database schemas might not match. Configuration files could be from the newer version. Data written by the problematic version remains in the database. Services that depended on the new API format now receive old responses and break.

Roll-forward with a hotfix has its own risks. You fixed one bug, but the hotfix might have changed something else. A configuration value that worked in the emergency fix might not be correct for normal operation. The hotfix might have been written quickly without the usual testing, and it could contain its own defects.

This is why verification after recovery is not optional. It is the safety net that catches the problems recovery itself creates.

Start With Smoke Tests

The fastest way to check if the system is alive is a smoke test. This is a short set of checks that confirm the core functions of your application work. Not every feature, not every edge case, just the critical paths that users depend on.

A good smoke test for a web application might include:

Can users log in?
Can they see their main dashboard?
Can they submit a form or complete a transaction?
Are there any errors in the application logs?

These checks should be automated. If your smoke test requires someone to manually click through screens, it will be skipped when the team is tired and under pressure after an incident. Automate it so that within minutes of completing the recovery, the smoke test runs and gives a clear pass or fail.

For example, a simple automated smoke test using curl might look like this:

#!/bin/bash
# Simple smoke test after rollback

BASE_URL="https://my-app.example.com"

# Check health endpoint
if ! curl -f -s -o /dev/null "$BASE_URL/health"; then
  echo "FAIL: Health endpoint unreachable"
  exit 1
fi

# Check login page loads
if ! curl -f -s -o /dev/null "$BASE_URL/login"; then
  echo "FAIL: Login page not loading"
  exit 1
fi

# Check API returns 200 for a critical endpoint
if ! curl -f -s -o /dev/null "$BASE_URL/api/v1/status"; then
  echo "FAIL: API status endpoint failed"
  exit 1
fi

echo "PASS: All smoke tests passed"

Compare Metrics Before and After

Smoke tests tell you the application is running. Metrics tell you if it is running well.

Compare key metrics from before the incident, during the incident, and after recovery. The metrics that matter depend on your application, but common ones include:

Requests per second
Error rate (percentage of failed requests)
Average response time
Memory and CPU usage
Database query latency

If the error rate after recovery is still higher than before the incident, something is wrong. Maybe a service did not restart properly. Maybe a configuration did not roll back. Maybe the recovery process itself introduced a new bottleneck.

Do not rely on a single metric. A low error rate combined with high latency can still mean the system is unhealthy. Look at the whole picture.

Database Verification Requires Extra Care

Database recovery is the trickiest part. If you rolled back the database to a backup taken before the deployment, any data written between that backup and the rollback is gone. That data might include user transactions, orders, or configuration changes.

You need to check:

Is the data that was lost acceptable from a business perspective?
Can you recover critical data from logs or other systems?
Are there any orphaned records left by the newer version that the old version cannot handle?

Sometimes the answer is that some data loss is acceptable. Other times you need to manually restore specific records. Either way, you need to know what was lost and decide whether it needs action.

Check Integrations With Other Systems

Your application does not live in isolation. After recovery, the old version might not be compatible with APIs that other teams changed while your deployment was in progress. Or the hotfix might have altered the format of data sent to monitoring, logging, or analytics systems.

Test the connections:

Can your application still call external APIs?
Do those APIs return responses your version can parse?
Are downstream services receiving the data they expect?

Integration issues after recovery are common because teams often focus on their own application and forget about the services that depend on it or that it depends on.

Verify From the User's Perspective

Dashboards and logs show what the system is doing. They do not always show what the user is experiencing. A user might see a blank page that does not generate an error log. A transaction might fail silently because the frontend does not display the error message.

If possible, have internal users test the main features after recovery. Or use real user monitoring tools to check metrics like transaction success rate and session duration. These metrics often reveal problems that infrastructure monitoring misses.

Document What You Found

After verification is complete, write down what happened. This documentation serves two purposes:

It provides evidence that the recovery was successful and the system is back to normal.
It helps your team improve the deployment and recovery process for next time.

Include what went wrong, what the recovery involved, what you checked during verification, and any issues you discovered. This record will help other team members who face similar problems in the future.

A Practical Verification Checklist

Here is a short checklist to use after any recovery:

The following flowchart illustrates the verification sequence with decision points at each step:

Smoke test passed for all critical functions
Error rate returned to pre-incident levels
Response time is normal
Resource usage (CPU, memory, disk) is stable
Database integrity checked, data loss documented
All external integrations are working
User-facing metrics show normal behavior
Findings documented for future reference

The Real End of an Incident

Verification after recovery is not a formality. It is the last line of defense before you declare the incident resolved. Without it, you risk calling the system healthy when it is still broken in ways that will create a bigger incident later.

The moment you confirm the system is actually working, that is when the incident truly ends. Everything before that is just recovery.