Kubernetes troubleshooting guide

Kubernetes CrashLoopBackOff:
What it means and how to fix it fast

Most engineers spend 20–30 minutes on a CrashLoopBackOff running the same 4 kubectl commands and Googling error messages. Here is a systematic approach to cut that to under 2 minutes.

What CrashLoopBackOff actually means

A container in CrashLoopBackOff is stuck in a loop: it starts, crashes, Kubernetes tries to restart it, it crashes again. Kubernetes applies exponential backoff between restarts to avoid hammering a broken container: 10s, 20s, 40s, 80s, 160s, then 300s (5 minutes) indefinitely.

The status CrashLoopBackOff means Kubernetes has detected the loop and is applying backoff. It does not mean the fix is to delete and recreate the pod — the underlying issue will persist.

Backoff timing

Restart 1: 10s
Restart 2: 20s
Restart 3: 40s
Restart 4: 80s
Restart 5: 160s
Restart 6: 300s (cap)

The 4-command loop every SRE knows

The standard debugging sequence for CrashLoopBackOff — in the order you should run them:

1
kubectl get pods -n <namespace>
Confirm the status and see the restart count.
2
kubectl describe pod <pod-name> -n <namespace>
Read the Events section and Last State exit code.
3
kubectl logs <pod-name> -n <namespace> --previous
Read the logs from the previous (crashed) container.
4
kubectl logs <pod-name> -n <namespace> -c <container>
If multi-container pod, target the specific crashing container.

After these four commands, you cross-reference the output with Grafana, check Slack history for recent deploys, and Google the specific error. Average time: 20–40 minutes.

Decode the exit code first

The exit code in kubectl describe pod narrows the diagnosis before you read a single log line. Find it under Last State > Exit Code.

0
Success
Container exited cleanly — not a crash. Check if it was supposed to keep running.
1
Application error
The app crashed. Check logs for the error message — missing config, connection refused, unhandled exception.
137
OOMKilled
Memory limit exceeded. The kernel killed the process. Increase memory limit or reduce usage.
139
Segmentation fault
Memory access violation. Usually an application bug or incompatible native library.
143
SIGTERM timeout
Container did not shut down in time during rolling update. Tune terminationGracePeriodSeconds.
255
Exit code unknown
Generic failure — no specific meaning. Treat as code 1 and check logs.

The 7 most common CrashLoopBackOff causes

01

Missing environment variable

The most common cause. App starts, tries to read DB_URL or API_KEY, finds nil, and crashes. Fix: verify all required env vars are set in the deployment spec or referenced secret.

kubectl describe pod <name> | grep -A5 Environment
02

OOMKilled — memory limit too low

Exit code 137. The app's peak memory usage exceeds its limit. Fix: run without limits briefly, measure peak usage with kubectl top pod, set limits 20% above peak.

kubectl describe pod <name> | grep -A3 "Last State"
03

Liveness probe too aggressive

The liveness probe fires before the app is ready, fails, kills the container, and the cycle repeats. Fix: add initialDelaySeconds to give the app time to start. Typical Java apps need 45–90s.

livenessProbe: initialDelaySeconds: 30 periodSeconds: 10
04

Dependency not ready at startup

App starts before its database or upstream service is available. Fix: add an init container that waits for the dependency, or implement retry logic in the app itself.

initContainers: - name: wait-for-db image: busybox command: ['sh', '-c', 'until nc -z db 5432; do sleep 2; done']
05

Bad image tag

Image pulled successfully but is the wrong version — incompatible binary or missing files. Fix: check the image tag in the deployment spec, verify the registry has the correct image, use digest pinning for critical deployments.

kubectl describe pod <name> | grep Image
06

Volume mount or permission issue

App tries to write to a mounted volume but doesn't have permission. Fix: check the securityContext, ensure the volume is mounted at the right path, and that the owning UID matches.

kubectl exec -it <pod> -- ls -la /data
07

Application bug (exit code 1)

The app has a runtime exception — null pointer, type error, unhandled panic. Fix: read the crash logs carefully. The last few lines before exit almost always contain the stack trace.

kubectl logs <name> --previous | tail -50

Diagnosing CrashLoopBackOff with ActivLayer

Instead of running four kubectl commands and cross-referencing multiple sources, you describe the problem in plain English:

$ activlayer analyze
> why is payment-service crashing in namespace production
Analyzing payment-service (exit code 137, 8 restarts)
Reading previous container logs...
Checking resource limits and usage history...

Root cause: OOMKilled. The container hit its 256Mi memory limit
during peak order processing (Black Friday traffic spike).
Peak usage: 312Mi. Limit: 256Mi.

Proposed fix: Increase memory limit to 512Mi
kubectl patch deployment payment-service -p ...

[dry-run preview] [approve and apply] [dismiss]

ActivLayer reads the pod logs, event history, and resource metrics automatically, identifies the root cause, and generates a specific remediation command. You approve the dry-run preview and the fix applies — all within your terminal.

Frequently asked questions

How do I fix CrashLoopBackOff in Kubernetes?
First check the exit code: run 'kubectl describe pod <name>' and look at Last State. Exit code 1 means an application error — check logs with 'kubectl logs <name> --previous'. Exit code 137 means OOMKilled — increase memory limits. Exit code 139 means segfault — likely a bug or incompatible library. Fix the underlying cause, then redeploy.
What causes exit code 137 in Kubernetes?
Exit code 137 means the container was killed by the kernel's OOM (Out Of Memory) killer. Kubernetes sets a memory limit on the container, and when the process exceeds it, the kernel kills it with signal 9 (SIGKILL). The exit code 137 = 128 + 9. Fix: increase the container's memory limit or reduce the application's memory usage.
How long does CrashLoopBackOff take to resolve on its own?
CrashLoopBackOff uses exponential backoff: 10s, 20s, 40s, 80s, 160s, capped at 300s (5 minutes). It does not resolve on its own — Kubernetes keeps retrying indefinitely until you fix the underlying issue or delete the pod. The backoff resets if the container runs for more than 10 minutes.
Can AI diagnose CrashLoopBackOff automatically?
Yes. ActivLayer can read your pod's logs, describe output, and event history, then identify the root cause in plain English. It explains what failed and why, generates a specific kubectl patch or config change, runs a dry-run preview, and only applies it after your approval. Average diagnosis time: under 15 seconds.
Stop debugging manually

Diagnose your next CrashLoopBackOff in under 15 seconds

Connect your cluster in 5 minutes. Free tier, no credit card, no demo call.

Try free — connect your clusterOOMKilled guide →