17-8 · Chapter 17 · 6 min read

When Production Breaks: Why You Need Image Traceability and Rollback

A new version of your application just went live. Five minutes later, users start reporting errors. The first question that comes up in the team chat

When Production Breaks: Why You Need Image Traceability and Rollback

A new version of your application just went live. Five minutes later, users start reporting errors. The first question that comes up in the team chat: "What version is running right now?"

If nobody can answer that question quickly, you lose precious time. Minutes turn into hours while people dig through deployment logs, check registry tags, and ask around to figure out what actually got deployed. By the time you know which image is in production, the damage has already grown.

This situation is more common than most teams admit. And it happens because two things were treated as optional: knowing exactly what is running, and having a reliable way to go back to something that worked.

Traceability Starts at Build Time

The ability to trace what is running in production begins the moment you build your container image. How you tag that image determines whether you can later identify it with certainty.

Tags like v1.2.3 or production are useful for humans. They help you recognize versions at a glance. But tags are not reliable for traceability. A tag is just a label that points to an image, and that label can change. The image myapp:production might point to version 1.2.3 today and version 1.3.0 tomorrow. If you only track tags, you never know for sure which version is actually running.

The reliable source of truth is the image digest. A digest is a unique hash generated from the content of the image. If two images have the same digest, they are identical. No ambiguity, no risk of wrong tagging, no overwritten labels. When you need to know exactly what is running, the digest is what you need.

Record the Digest, Not Just the Tag

In your pipeline, you should capture the digest of every image that passes through each stage. When an image is built, record its digest. When it passes security scanning, record it again. When it gets promoted to staging and then to production, keep that digest in your deployment records.

Where do you store this information? The most practical place is your deployment manifest. A deployment manifest is the file that tells your system how to run the container. In Kubernetes, that is a YAML file. In Docker Compose, it is a compose file. Every time you deploy, the manifest should reference the exact digest, not just the tag.

Here is what that looks like in a Kubernetes deployment:

To capture the digest in your pipeline, use a sequence like this:

# Build and push the image
docker build -t myregistry.com/myapp:latest .
docker push myregistry.com/myapp:latest

# Capture the digest from the registry
export DIGEST=$(docker inspect --format='{{index .RepoDigests 0}}' myregistry.com/myapp:latest)
echo "Deploying image: $DIGEST"

# Use the digest in your deployment manifest
sed "s|image: myregistry.com/myapp:latest|image: $DIGEST|" deployment.yaml > deployment-digest.yaml
kubectl apply -f deployment-digest.yaml

This ensures your deployment always references the exact image content, not a mutable tag.

The following sequence diagram illustrates where digest recording and rollback fit into the deployment lifecycle:

sequenceDiagram participant Dev as Developer participant CI as CI Pipeline participant Reg as Registry participant K8s as Kubernetes participant User as Users Dev->>CI: Push code CI->>Reg: Build & push image<br/>(record digest) CI->>K8s: Deploy with digest<br/>@sha256:... K8s->>User: Serve new version User->>K8s: Report errors K8s->>CI: Alert: production broken CI->>K8s: Rollback: kubectl rollout undo<br/>(uses previous digest) K8s->>User: Serve previous version

spec:
  template:
    spec:
      containers:
      - name: myapp
        image: myregistry.com/myapp@sha256:a1b2c3d4e5f6...

Notice the @sha256:... part. That is the digest. When you use this format, you are telling Kubernetes to run that exact image, not whatever latest happens to point to.

By recording the digest in your manifest, you create a permanent record. You can look back at any point in time and know exactly which image was running. You can see when it was deployed, who triggered the deployment, and what changes came with it.

Without this record, you are guessing. And guessing during an incident is expensive.

Rollback: The Safety Net You Build Before You Need It

Traceability gives you the answer to "what is running?" Rollback gives you the answer to "how do we go back to something that worked?"

Rollback is the process of returning your application to a previous version of the image that was known to be stable. But you cannot do this effectively in the middle of an incident. You need to prepare for it before deployment.

A good rollback strategy starts with three questions:

Is the previous image still available in the registry?
Is the previous deployment manifest still usable?
Is the previous image compatible with the current configuration?

Many teams store their deployment manifests in Git. Every time they deploy, they commit the manifest with the exact digest. If something goes wrong, they can revert the manifest to a previous commit and redeploy. This is simple, auditable, and works across different environments.

In Kubernetes, you can use kubectl rollout undo to revert to a previous revision. This command works because Kubernetes keeps a history of deployment revisions. But you need to configure how many revisions to keep. Too few, and you lose the ability to roll back far enough. Too many, and you consume cluster memory for history you may never use.

When Rollback Works and When It Does Not

Rollback is fast and effective for application-level problems. If a new version introduced a bug in business logic, or a library update broke something, rolling back to the previous image restores service quickly.

But rollback is not a universal fix. If the problem is in the database schema, rolling back the application image may not help. The database might already be in a state that the old application code cannot handle. If the problem is in configuration that was changed separately from the image, rolling back the image alone leaves the bad configuration in place.

Know the boundaries of your rollback mechanism. Test it regularly. Make sure your team knows when to use it and when to look for another solution.

After the Rollback, Fix the Root Cause

Rollback restores service. It does not fix the problem. Once you have rolled back and users are no longer affected, the real work begins.

The image that caused the problem needs to be fixed. The pipeline should keep running with the corrected version. That new image goes through the same process: build, scan, promote, deploy. The rollback was a safety net, not the end of the journey.

Some teams make the mistake of treating rollback as the final step. They roll back, declare the incident resolved, and move on. The same bug surfaces again in the next release because nobody investigated the root cause. Do not let that happen.

Practical Checklist

Before your next production deployment, run through this checklist:

Every image in the pipeline is referenced by digest, not just by tag
Deployment manifests are stored in version control with the exact digest
Previous images are retained in the registry for at least the last N versions
Rollback procedure is documented and tested in a non-production environment
The team knows the difference between problems that rollback can fix and problems that need a different approach

What This Means for Your Team

Traceability and rollback are not advanced topics. They are basic operational hygiene. You do not need a complex platform or expensive tools to implement them. You need discipline in how you tag images, how you record deployments, and how you prepare for the moment something goes wrong.

The next time production breaks, the first question will still be: "What version is running?" With image traceability in place, you will have the answer in seconds. And with a tested rollback mechanism, you will be able to restore service in minutes instead of hours.

Build the safety net before you need it. Your future self, debugging at 2 AM, will thank you.