Java Spring Boot Probe Troubleshooting
Problem#
Java Spring Boot apps in the Staging EKS environment kept restarting in CrashLoopBackOff.
Symptoms#
- Pods receive SIGTERM (exit code 143) within 60 seconds of starting
- kubelet forcefully kills the container due to liveness probe failures
- 404 errors (initially) → 503/500 errors (after partial fix)
Root Cause Analysis#
1. Missing Context Path (Primary Cause)#
Spring Boot apps use a context path:
- auth-guard:
/auth - queue:
/queue - seat:
/seat - order-core:
/order
But the probe configuration was:
livenessProbe:
httpGet:
path: /actuator/health/liveness # 404 error!
The actual paths should be:
/auth/actuator/health/liveness
/queue/actuator/health/liveness
...
2. Insufficient initialDelaySeconds (Secondary Cause)#
The issue persisted after fixing the context path. The root cause was Java app startup time.
Original configuration:
livenessProbe:
initialDelaySeconds: 60 # too short
readinessProbe:
initialDelaySeconds: 30
Java Spring Boot startup times:
- Simple app: 10–30 seconds
- Medium-sized app: 30–60 seconds
- Complex app (DB connections, caching, etc.): 60–120+ seconds
Observed startup time in Staging: 50–90 seconds
Dev vs Staging Environment Differences#
Dev Environment (Mini PC)#
- CPU: High-performance desktop CPU (Intel/AMD)
- Memory: Ample RAM
- Disk: NVMe SSD
- JIT compilation: Fast
Staging Environment (EKS t4g.large)#
- CPU: AWS Graviton2 (ARM64) 2 vCPU
- Memory: 8GB (shared across multiple Pods)
- Spot instances: possible resource contention
- JIT compilation: Can be slightly slower on ARM
Conclusion: t4g.large is not a bad spec, but Java cold start can be 10–20% slower on ARM vs x86. Additionally, multiple Pods starting simultaneously can slow things down further due to resource contention.
Fix#
1. Helm Chart Update (java-service)#
Added contextPath support in deployment.yaml:
livenessProbe:
httpGet:
path: {{ .Values.contextPath }}{{ .Values.livenessProbe.httpGet.path }}
2. Optimized Probe Settings#
Recommended settings (Java Spring Boot):
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 120 # wait for app startup (2 min)
periodSeconds: 10 # check interval
timeoutSeconds: 5 # timeout
failureThreshold: 6 # allowed failures
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 60 # shorter than liveness
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 6
3. Per-Service Settings#
| Service | Port | Context Path | Liveness Delay | Readiness Delay |
|---|---|---|---|---|
| api-gateway | 8085 | (none) | 120s | 60s |
| auth-guard | 8080 | /auth | 120s | 60s |
| queue | 8081 | /queue | 120s | 60s |
| seat | 8082 | /seat | 120s | 60s |
| order-core | 8083 | /order | 120s | 60s |
Probe Configuration Guide#
How to determine initialDelaySeconds#
- Measure app startup time: Find “Started Application in X seconds” in the logs
- Add buffer: measured value + 30–50%
- Account for environment: production should have 10–20% more headroom than staging
failureThreshold setting#
failureThreshold * periodSeconds= total allowed time- Example:
failureThreshold: 6, periodSeconds: 10= 60 additional seconds of tolerance
liveness vs readiness#
| liveness | readiness | |
|---|---|---|
| Purpose | Check if the app has died | Check if the app is ready for traffic |
| On failure | Restart the container | Remove from service |
| Initial delay | Long (wait for full startup) | Shorter (serve traffic sooner) |
Modified Files#
303-goormgb-k8s-helm#
common-charts/apps/java-service/values.yaml- added contextPathcommon-charts/apps/java-service/templates/deployment.yaml- applied contextPathstaging/values/apps/values-api-gateway.yaml- probe settingsstaging/values/apps/values-auth-guard.yaml- contextPath + probestaging/values/apps/values-queue.yaml- contextPath + probestaging/values/apps/values-seat.yaml- contextPath + probestaging/values/apps/values-order-core.yaml- contextPath + probe
Verification Commands#
kubectl get pods -n staging-webs -w
# Check Pod events (probe failure logs)
kubectl describe pod <pod-name> -n staging-webs
# Check app startup logs
kubectl logs <pod-name> -n staging-webs | grep -i "started"
# Test probe directly (from inside the Pod)
kubectl exec -it <pod-name> -n staging-webs -- curl localhost:8080/auth/actuator/health/liveness
Lessons Learned#
- Java is slow to start: Always account for cold start time generously
- Check the context path: Always verify
server.servlet.context-pathin Spring Boot configuration - Environment differences matter: Consider infrastructure differences between Dev and Staging/Prod
- Tune based on logs: Measure actual startup time before setting probe values
Written: 2026-03-18 Related: Staging EKS deployment troubleshooting