Problem#

Java Spring Boot apps in the Staging EKS environment kept restarting in CrashLoopBackOff.

Symptoms#

  • Pods receive SIGTERM (exit code 143) within 60 seconds of starting
  • kubelet forcefully kills the container due to liveness probe failures
  • 404 errors (initially) → 503/500 errors (after partial fix)

Root Cause Analysis#

1. Missing Context Path (Primary Cause)#

Spring Boot apps use a context path:

  • auth-guard: /auth
  • queue: /queue
  • seat: /seat
  • order-core: /order

But the probe configuration was:

livenessProbe:
  httpGet:
    path: /actuator/health/liveness  # 404 error!

The actual paths should be:

/auth/actuator/health/liveness
/queue/actuator/health/liveness
...

2. Insufficient initialDelaySeconds (Secondary Cause)#

The issue persisted after fixing the context path. The root cause was Java app startup time.

Original configuration:

livenessProbe:
  initialDelaySeconds: 60  # too short
readinessProbe:
  initialDelaySeconds: 30

Java Spring Boot startup times:

  • Simple app: 10–30 seconds
  • Medium-sized app: 30–60 seconds
  • Complex app (DB connections, caching, etc.): 60–120+ seconds

Observed startup time in Staging: 50–90 seconds

Dev vs Staging Environment Differences#

Dev Environment (Mini PC)#

  • CPU: High-performance desktop CPU (Intel/AMD)
  • Memory: Ample RAM
  • Disk: NVMe SSD
  • JIT compilation: Fast

Staging Environment (EKS t4g.large)#

  • CPU: AWS Graviton2 (ARM64) 2 vCPU
  • Memory: 8GB (shared across multiple Pods)
  • Spot instances: possible resource contention
  • JIT compilation: Can be slightly slower on ARM

Conclusion: t4g.large is not a bad spec, but Java cold start can be 10–20% slower on ARM vs x86. Additionally, multiple Pods starting simultaneously can slow things down further due to resource contention.

Fix#

1. Helm Chart Update (java-service)#

Added contextPath support in deployment.yaml:

livenessProbe:
  httpGet:
    path: {{ .Values.contextPath }}{{ .Values.livenessProbe.httpGet.path }}

2. Optimized Probe Settings#

Recommended settings (Java Spring Boot):

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 120    # wait for app startup (2 min)
  periodSeconds: 10           # check interval
  timeoutSeconds: 5           # timeout
  failureThreshold: 6         # allowed failures

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 60     # shorter than liveness
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 6

3. Per-Service Settings#

ServicePortContext PathLiveness DelayReadiness Delay
api-gateway8085(none)120s60s
auth-guard8080/auth120s60s
queue8081/queue120s60s
seat8082/seat120s60s
order-core8083/order120s60s

Probe Configuration Guide#

How to determine initialDelaySeconds#

  1. Measure app startup time: Find “Started Application in X seconds” in the logs
  2. Add buffer: measured value + 30–50%
  3. Account for environment: production should have 10–20% more headroom than staging

failureThreshold setting#

  • failureThreshold * periodSeconds = total allowed time
  • Example: failureThreshold: 6, periodSeconds: 10 = 60 additional seconds of tolerance

liveness vs readiness#

livenessreadiness
PurposeCheck if the app has diedCheck if the app is ready for traffic
On failureRestart the containerRemove from service
Initial delayLong (wait for full startup)Shorter (serve traffic sooner)

Modified Files#

303-goormgb-k8s-helm#

  • common-charts/apps/java-service/values.yaml - added contextPath
  • common-charts/apps/java-service/templates/deployment.yaml - applied contextPath
  • staging/values/apps/values-api-gateway.yaml - probe settings
  • staging/values/apps/values-auth-guard.yaml - contextPath + probe
  • staging/values/apps/values-queue.yaml - contextPath + probe
  • staging/values/apps/values-seat.yaml - contextPath + probe
  • staging/values/apps/values-order-core.yaml - contextPath + probe

Verification Commands#

kubectl get pods -n staging-webs -w

# Check Pod events (probe failure logs)
kubectl describe pod <pod-name> -n staging-webs

# Check app startup logs
kubectl logs <pod-name> -n staging-webs | grep -i "started"

# Test probe directly (from inside the Pod)
kubectl exec -it <pod-name> -n staging-webs -- curl localhost:8080/auth/actuator/health/liveness

Lessons Learned#

  1. Java is slow to start: Always account for cold start time generously
  2. Check the context path: Always verify server.servlet.context-path in Spring Boot configuration
  3. Environment differences matter: Consider infrastructure differences between Dev and Staging/Prod
  4. Tune based on logs: Measure actual startup time before setting probe values

Written: 2026-03-18 Related: Staging EKS deployment troubleshooting