35-4 · Chapter 35 · 6 min read

Deployment Is Not Done Until You Know It's Working

A team pushes a new version to production. The pipeline is green. The deployment log shows no errors. Everyone breathes a sigh of relief and moves on to

Deployment Is Not Done Until You Know It's Working

A team pushes a new version to production. The pipeline is green. The deployment log shows no errors. Everyone breathes a sigh of relief and moves on to the next task. Two hours later, a customer emails support saying the checkout page is broken. The team scrambles to investigate, rolls back, and spends the next day figuring out what went wrong.

This scenario is common. Many teams treat deployment as finished the moment the new version starts running on production servers. But in practice, deployment is only complete when you know whether that version is actually working well for users.

Production Is Not a Clean Room

When a new version goes live, it enters an environment that staging can never fully replicate. Real users bring real data, real traffic patterns, and real device configurations. Things happen that no test environment predicted.

A query that ran fine with a thousand rows in staging might slow to a crawl with a million rows in production. An API change that seemed backward-compatible might break a mobile client that hasn't been updated in six months. A new feature that looked great in design reviews might confuse users so badly that nobody clicks on it.

These are not failures. They are signals. The question is whether your team is set up to catch them.

Signals That Matter

Good teams don't wait for users to report problems. They set up automated systems that capture signals from production. The most useful signals fall into a few categories:

Error rate changes. If error rates spike right after a deployment, something is likely broken. A 5 percent increase across all endpoints probably needs immediate rollback. A 0.1 percent increase on a rarely used endpoint might be a bug to fix in the next release.
Response time degradation. Slower responses often point to database bottlenecks, inefficient queries, or resource contention. This signal is especially important because users may not complain immediately, but they will start abandoning the service.
Transaction volume drops. A sudden decrease in completed transactions can mean users are hitting errors, getting stuck in a flow, or simply giving up. This signal is harder to detect because it requires comparing current traffic against historical baselines.

Each signal means something different. The key is knowing which signals need immediate action and which can wait.

Here is a practical example of how a team might automate the decision to roll back based on error rate:

#!/bin/bash
# Post-deployment health check: query error rate and rollback if > 5%
THRESHOLD=5.0
DEPLOY_ID=$(curl -s "https://monitoring.example.com/api/v1/deploy/latest" | jq -r '.id')
ERROR_RATE=$(curl -s "https://monitoring.example.com/api/v1/query?query=error_rate{deploy_id=$DEPLOY_ID}" | jq -r '.data.result[0].value[1]')

if (( $(echo "$ERROR_RATE > $THRESHOLD" | bc -l) )); then
  echo "Error rate $ERROR_RATE% exceeds threshold $THRESHOLD%. Rolling back..."
  kubectl rollout undo deployment/my-app
  exit 1
else
  echo "Error rate $ERROR_RATE% is within limits. Deployment confirmed."
fi

From Signal to Root Cause

Once a signal is detected, the next step is finding the root cause. Is it a code bug? A configuration mismatch? A data issue? An infrastructure problem? The answer determines who fixes it and how.

The following flowchart illustrates how a team can move from deployment to signal detection, root cause analysis, and action.

flowchart TD A[Deploy new version] --> B[Monitor signals] B --> C{Signal normal?} C -->|Yes| D[Deployment confirmed] C -->|No| E[Investigate root cause] E --> F{Code bug?} F -->|Yes| G[Fix code] F -->|No| H{Config issue?} H -->|Yes| I[Fix config] H -->|No| J{Data or infra?} J -->|Yes| K[Fix data/infra] J -->|No| L[Escalate] G --> M[Rollback or hotfix] I --> M K --> M L --> M M --> B

This is where many teams get stuck. They see an error spike and immediately assume the code is wrong. But sometimes the code is fine and the problem is a configuration value that differs between staging and production. Sometimes the database schema migration ran correctly but the application code was deployed before the migration finished.

A mature team doesn't just fix the immediate problem. They also fix the process that let the problem through. If a database migration caused trouble, add migration checks to the pipeline. If staging and production configurations drifted, make them identical. If a certain type of change keeps causing issues, update the deployment checklist to catch it earlier.

Feedback Improves Decision-Making

Feedback from production is not just about fixing bugs. It also helps teams evaluate their own decisions. Remember the readiness criteria you set before deployment? Did they actually prevent problems? Or did a serious issue slip through because your criteria didn't cover that scenario?

With real data from production, teams can adjust their deployment criteria. They can see which checks are effective and which ones create false confidence. They can identify patterns: maybe all database-related incidents happen on Friday deployments, so they stop deploying database changes on Fridays. Maybe all configuration-related incidents happen when a specific team member is on leave, so they add a backup reviewer.

This is how deployment processes improve over time. Not by following best practices from a book, but by learning from your own production data.

Speed of Feedback Matters

The faster feedback reaches the team, the faster they can respond. That is why post-deployment validation is a critical practice. Instead of waiting for errors to accumulate, teams actively check whether the new version is running normally after the first few minutes or hours.

Post-deployment validation can take several forms:

Automated smoke tests that run against production endpoints right after deployment.
Metric comparisons that show before-and-after snapshots of error rates, response times, and throughput.
Log analysis that looks for unusual patterns in the first few minutes of traffic.

Some teams go further and run canary deployments, where the new version serves only a small percentage of traffic. If signals look good, traffic is gradually increased. If signals turn bad, the canary is rolled back automatically. This approach limits blast radius while still giving real production feedback.

Feedback Needs a System

Collecting feedback is useless if there is no system to manage it. Teams need a way to gather signals, filter noise, and prioritize actions. A dashboard full of graphs is not enough. The system must help the team make better decisions.

This means defining clear thresholds for each signal. "If error rate exceeds 2 percent for more than five minutes, page the on-call engineer." "If response time doubles for any critical endpoint, create a ticket for the next sprint." Without thresholds, every signal looks urgent, and the team burns out chasing false alarms.

It also means having a clear escalation path. Not every signal needs the same response. Some signals trigger automated rollback. Some trigger a ticket. Some trigger a meeting to discuss whether the deployment process needs changes. The system should make these distinctions clear.

A Practical Checklist for Production Feedback

Here is a short checklist to evaluate whether your team is getting useful feedback from production:

Do you have automated alerts for error rate, response time, and transaction volume changes after every deployment?
Do you know the baseline values for these metrics before you deploy?
Do you have clear thresholds that distinguish between "investigate later" and "rollback now"?
Do you run post-deployment smoke tests against production?
Do you review deployment incidents to improve your pipeline and criteria?
Does feedback from production ever change how you build your pipeline?

If you answered no to more than two of these, your team is flying blind after deployment.

The Real End of a Deployment

A deployment is not finished when the new version is running. It is finished when the team knows the new version is running well. That knowledge comes from feedback systems that capture signals, filter noise, and drive action. Without those systems, every deployment is a gamble. With them, every deployment becomes a learning opportunity that makes the next one safer.