What Happens After Rollback: Verifying Your Recovery Actually Worked
You just hit the rollback button. The deployment that caused errors, slow responses, or database corruption is gone. Your application is back on the previous version. Your team breathes a sigh of relief.
But is the system actually working?
Rolling back or rolling forward is not the end of an incident. It is the middle. The real question is whether the recovery itself introduced new problems, left behind残留 data, or broke integrations with other services. Without proper verification after recovery, you might declare the incident resolved while the system is still broken in ways that only surface hours later.
The Hidden Risks of Recovery
When you roll back an application, you are not simply undoing time. The old version returns, but the environment may have changed. Database schemas might not match. Configuration files could be from the newer version. Data written by the problematic version remains in the database. Services that depended on the new API format now receive old responses and break.
Roll-forward with a hotfix has its own risks. You fixed one bug, but the hotfix might have changed something else. A configuration value that worked in the emergency fix might not be correct for normal operation. The hotfix might have been written quickly without the usual testing, and it could contain its own defects.
This is why verification after recovery is not optional. It is the safety net that catches the problems recovery itself creates.
Start With Smoke Tests
The fastest way to check if the system is alive is a smoke test. This is a short set of checks that confirm the core functions of your application work. Not every feature, not every edge case, just the critical paths that users depend on.
A good smoke test for a web application might include:
- Can users log in?
- Can they see their main dashboard?
- Can they submit a form or complete a transaction?
- Are there any errors in the application logs?
These checks should be automated. If your smoke test requires someone to manually click through screens, it will be skipped when the team is tired and under pressure after an incident. Automate it so that within minutes of completing the recovery, the smoke test runs and gives a clear pass or fail.
For example, a simple automated smoke test using curl might look like this:
#!/bin/bash
# Simple smoke test after rollback
BASE_URL="https://my-app.example.com"
# Check health endpoint
if ! curl -f -s -o /dev/null "$BASE_URL/health"; then
echo "FAIL: Health endpoint unreachable"
exit 1
fi
# Check login page loads
if ! curl -f -s -o /dev/null "$BASE_URL/login"; then
echo "FAIL: Login page not loading"
exit 1
fi
# Check API returns 200 for a critical endpoint
if ! curl -f -s -o /dev/null "$BASE_URL/api/v1/status"; then
echo "FAIL: API status endpoint failed"
exit 1
fi
echo "PASS: All smoke tests passed"
Compare Metrics Before and After
Smoke tests tell you the application is running. Metrics tell you if it is running well.
Compare key metrics from before the incident, during the incident, and after recovery. The metrics that matter depend on your application, but common ones include:
- Requests per second
- Error rate (percentage of failed requests)
- Average response time
- Memory and CPU usage
- Database query latency
If the error rate after recovery is still higher than before the incident, something is wrong. Maybe a service did not restart properly. Maybe a configuration did not roll back. Maybe the recovery process itself introduced a new bottleneck.
Do not rely on a single metric. A low error rate combined with high latency can still mean the system is unhealthy. Look at the whole picture.
Database Verification Requires Extra Care
Database recovery is the trickiest part. If you rolled back the database to a backup taken before the deployment, any data written between that backup and the rollback is gone. That data might include user transactions, orders, or configuration changes.
You need to check:
- Is the data that was lost acceptable from a business perspective?
- Can you recover critical data from logs or other systems?
- Are there any orphaned records left by the newer version that the old version cannot handle?
Sometimes the answer is that some data loss is acceptable. Other times you need to manually restore specific records. Either way, you need to know what was lost and decide whether it needs action.
Check Integrations With Other Systems
Your application does not live in isolation. After recovery, the old version might not be compatible with APIs that other teams changed while your deployment was in progress. Or the hotfix might have altered the format of data sent to monitoring, logging, or analytics systems.
Test the connections:
- Can your application still call external APIs?
- Do those APIs return responses your version can parse?
- Are downstream services receiving the data they expect?
Integration issues after recovery are common because teams often focus on their own application and forget about the services that depend on it or that it depends on.
Verify From the User's Perspective
Dashboards and logs show what the system is doing. They do not always show what the user is experiencing. A user might see a blank page that does not generate an error log. A transaction might fail silently because the frontend does not display the error message.
If possible, have internal users test the main features after recovery. Or use real user monitoring tools to check metrics like transaction success rate and session duration. These metrics often reveal problems that infrastructure monitoring misses.
Document What You Found
After verification is complete, write down what happened. This documentation serves two purposes:
- It provides evidence that the recovery was successful and the system is back to normal.
- It helps your team improve the deployment and recovery process for next time.
Include what went wrong, what the recovery involved, what you checked during verification, and any issues you discovered. This record will help other team members who face similar problems in the future.
A Practical Verification Checklist
Here is a short checklist to use after any recovery:
The following flowchart illustrates the verification sequence with decision points at each step:
- Smoke test passed for all critical functions
- Error rate returned to pre-incident levels
- Response time is normal
- Resource usage (CPU, memory, disk) is stable
- Database integrity checked, data loss documented
- All external integrations are working
- User-facing metrics show normal behavior
- Findings documented for future reference
The Real End of an Incident
Verification after recovery is not a formality. It is the last line of defense before you declare the incident resolved. Without it, you risk calling the system healthy when it is still broken in ways that will create a bigger incident later.
The moment you confirm the system is actually working, that is when the incident truly ends. Everything before that is just recovery.