Kubernetes troubleshooting guide

Kubernetes OOMKilled:
Root Cause, Fix, and Prevention

OOMKilled (exit code 137) means your container hit its memory limit and the kernel killed it. Here is how to confirm the diagnosis, find the root cause, and stop it from recurring.

What OOMKilled actually means

Kubernetes enforces memory limits via Linux cgroups. When a container's memory usage exceeds resources.limits.memory, the kernel's OOM (Out Of Memory) killer fires. It sends SIGKILL (signal 9) to the container process — there is no graceful shutdown, no SIGTERM, no cleanup.

The resulting exit code is 137 (128 + signal 9). If Kubernetes is configured to restart the container, it enters CrashLoopBackOff if it keeps getting killed.

How to confirm OOMKilled

kubectl describe pod <pod-name> -n <namespace>

# Look for:
Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137

Why containers get OOMKilled

Memory limit set too low

The most common cause. The limit was set at deployment time based on a guess, and actual peak usage exceeds it — especially under load.

Fix

Measure real usage first. Deploy to staging with no limits, run realistic load, check peak usage with kubectl top pod, then set limits 20-30% above peak.

Traffic spike beyond designed capacity

The app works fine at normal load but OOMKills during traffic spikes. Memory usage scales with concurrent requests and your limits don't account for peaks.

Fix

Set limits based on maximum expected load, not average. Consider horizontal pod autoscaling so new pods absorb traffic before existing ones hit limits.

Memory leak in application code

Memory usage grows steadily over time and never stabilizes. The garbage collector cannot reclaim objects because references are held. Raising the limit only delays the next kill.

Fix

Profile the application to find the leak. In JVM apps, take a heap dump. In Go apps, use pprof. In Node.js, use --inspect and Chrome DevTools. Fix the leak in code.

JVM heap + native memory exceeding limit

Java and Scala apps have both heap (-Xmx) and native memory. If -Xmx is close to the container limit, native memory pushes total usage over the limit even if the heap looks fine.

Fix

Set -Xmx to 75% of the memory limit to leave headroom for native memory, JVM metaspace, and thread stacks.

Sidecar or init container consuming memory

In multi-container pods, all containers share the pod's memory limit. A logging sidecar or monitoring agent using memory contributes to the total.

Fix

Check kubectl top pod --containers to see per-container usage. Set separate limits per container that account for all sidecars.

How to measure real memory usage

Current usage (live pods)

# Per pod
kubectl top pod <name> -n <namespace>

# Per container within pod
kubectl top pod <name> -n <namespace> --containers

Recommended limit formula

# 1. Run without limits under max load
# 2. Record peak usage
# 3. Set limits:
memory request = avg_steady_state * 1.0
memory limit   = peak_usage * 1.25

# Example: peak = 400Mi
# request: 256Mi, limit: 512Mi

PrometheusQL — alert before OOMKill

# Alert when container uses >80% of its memory limit
container_memory_working_set_bytes{container!=""}
  /
container_spec_memory_limit_bytes{container!=""}
  > 0.80

Diagnosing OOMKilled with ActivLayer

ActivLayer correlates the OOMKill event with your metrics history, identifies whether it was a one-time spike or a steady leak, and proposes a specific fix — including the exact memory limit value to set.

$ activlayer analyze

> payment-service is OOMKilled again, third time this week

Checking memory usage history (7d)...
Correlating with traffic and deploy events...

Pattern: Memory grows linearly — 8MB/hour increase.
This is a memory leak, not a limit misconfiguration.

Correlation: Leak started after deploy at 14:23 on Apr 24.
Commit: 3e4f891 — added in-memory session cache without TTL.

Recommended actions (in order):
1. Increase limit to 1Gi as temporary relief
2. Revert to pre-Apr-24 version or add TTL to session cache

[apply temporary limit increase] [see revert options] [dismiss]

Frequently asked questions

What does OOMKilled mean in Kubernetes?

OOMKilled (Out Of Memory Killed) means the container exceeded its memory limit and the Linux kernel's OOM killer terminated it. Kubernetes sets the limit from resources.limits.memory in your container spec. When the process exceeds this limit, the kernel sends SIGKILL (signal 9), producing exit code 137 (128 + 9).

Why does OOMKilled keep happening even after I increase the limit?

If OOMKilled recurs after raising the limit, the container has a memory leak — it is consuming memory indefinitely rather than stabilizing. Increasing the limit only delays the next kill. You need to profile the application to find which objects are not being garbage collected and fix the leak in code.

What is the difference between OOMKilled and CrashLoopBackOff?

They are related but different. OOMKilled is the cause — the container was killed by the kernel for exceeding memory limits. CrashLoopBackOff is the symptom — Kubernetes keeps restarting the container after it crashes. A container can enter CrashLoopBackOff for many reasons; OOMKilled is one of the most common (exit code 137 in Last State confirms it).

How do I set the right memory limits in Kubernetes?

1. Deploy without limits and run under realistic load. 2. Measure peak memory with 'kubectl top pod' or your metrics backend. 3. Set requests equal to typical steady-state usage. 4. Set limits to 120-150% of measured peak. 5. Alert when usage exceeds 80% of limit, so you have time to react before the next OOMKill.

Stop guessing memory limits

Get root cause and a specific fix in seconds

Free Community tier — connect your cluster in 5 minutes.

Try free CrashLoopBackOff guide →

Kubernetes OOMKilled:Root Cause, Fix, and Prevention