Java Spring Boot Probe Troubleshooting

Problem#

Java Spring Boot apps in the Staging EKS environment kept restarting in CrashLoopBackOff.

Symptoms#

Pods receive SIGTERM (exit code 143) within 60 seconds of starting
kubelet forcefully kills the container due to liveness probe failures
404 errors (initially) → 503/500 errors (after partial fix)

Root Cause Analysis#

1. Missing Context Path (Primary Cause)#

Spring Boot apps use a context path:

auth-guard: /auth
queue: /queue
seat: /seat
order-core: /order

But the probe configuration was:

livenessProbe:
  httpGet:
    path: /actuator/health/liveness  # 404 error!

The actual paths should be:

/auth/actuator/health/liveness
/queue/actuator/health/liveness
...

2. Insufficient initialDelaySeconds (Secondary Cause)#

The issue persisted after fixing the context path. The root cause was Java app startup time.

Original configuration:

livenessProbe:
  initialDelaySeconds: 60  # too short
readinessProbe:
  initialDelaySeconds: 30

Java Spring Boot startup times:

Simple app: 10–30 seconds
Medium-sized app: 30–60 seconds
Complex app (DB connections, caching, etc.): 60–120+ seconds

Observed startup time in Staging: 50–90 seconds

Dev vs Staging Environment Differences#

Dev Environment (Mini PC)#

CPU: High-performance desktop CPU (Intel/AMD)
Memory: Ample RAM
Disk: NVMe SSD
JIT compilation: Fast

Staging Environment (EKS t4g.large)#

CPU: AWS Graviton2 (ARM64) 2 vCPU
Memory: 8GB (shared across multiple Pods)
Spot instances: possible resource contention
JIT compilation: Can be slightly slower on ARM

Conclusion: t4g.large is not a bad spec, but Java cold start can be 10–20% slower on ARM vs x86. Additionally, multiple Pods starting simultaneously can slow things down further due to resource contention.

Fix#

1. Helm Chart Update (java-service)#

Added contextPath support in deployment.yaml:

livenessProbe:
  httpGet:
    path: {{ .Values.contextPath }}{{ .Values.livenessProbe.httpGet.path }}

2. Optimized Probe Settings#

Recommended settings (Java Spring Boot):

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 120    # wait for app startup (2 min)
  periodSeconds: 10           # check interval
  timeoutSeconds: 5           # timeout
  failureThreshold: 6         # allowed failures

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 60     # shorter than liveness
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 6

3. Per-Service Settings#

Service	Port	Context Path	Liveness Delay	Readiness Delay
api-gateway	8085	(none)	120s	60s
auth-guard	8080	/auth	120s	60s
queue	8081	/queue	120s	60s
seat	8082	/seat	120s	60s
order-core	8083	/order	120s	60s

Probe Configuration Guide#

How to determine initialDelaySeconds#

Measure app startup time: Find “Started Application in X seconds” in the logs
Add buffer: measured value + 30–50%
Account for environment: production should have 10–20% more headroom than staging

failureThreshold setting#

failureThreshold * periodSeconds = total allowed time
Example: failureThreshold: 6, periodSeconds: 10 = 60 additional seconds of tolerance

liveness vs readiness#

	liveness	readiness
Purpose	Check if the app has died	Check if the app is ready for traffic
On failure	Restart the container	Remove from service
Initial delay	Long (wait for full startup)	Shorter (serve traffic sooner)

Modified Files#

303-goormgb-k8s-helm#

common-charts/apps/java-service/values.yaml - added contextPath
common-charts/apps/java-service/templates/deployment.yaml - applied contextPath
staging/values/apps/values-api-gateway.yaml - probe settings
staging/values/apps/values-auth-guard.yaml - contextPath + probe
staging/values/apps/values-queue.yaml - contextPath + probe
staging/values/apps/values-seat.yaml - contextPath + probe
staging/values/apps/values-order-core.yaml - contextPath + probe

Verification Commands#

kubectl get pods -n staging-webs -w

# Check Pod events (probe failure logs)
kubectl describe pod <pod-name> -n staging-webs

# Check app startup logs
kubectl logs <pod-name> -n staging-webs | grep -i "started"

# Test probe directly (from inside the Pod)
kubectl exec -it <pod-name> -n staging-webs -- curl localhost:8080/auth/actuator/health/liveness

Lessons Learned#

Java is slow to start: Always account for cold start time generously
Check the context path: Always verify server.servlet.context-path in Spring Boot configuration
Environment differences matter: Consider infrastructure differences between Dev and Staging/Prod
Tune based on logs: Measure actual startup time before setting probe values

Written: 2026-03-18 Related: Staging EKS deployment troubleshooting