Java Spring Boot Probe 트러블슈팅

문제 상황#

Staging EKS 환경에서 Java Spring Boot 앱들이 CrashLoopBackOff 상태로 계속 재시작됨.

증상#

Pod가 시작 후 60초 내에 SIGTERM (exit code 143) 수신
liveness probe 실패로 kubelet이 컨테이너 강제 종료
404 에러 (초기) → 503/500 에러 (수정 후)

원인 분석#

1. Context Path 누락 (1차 원인)#

Spring Boot 앱들이 context path를 사용:

auth-guard: /auth
queue: /queue
seat: /seat
order-core: /order

하지만 probe 설정은:

livenessProbe:
  httpGet:
    path: /actuator/health/liveness  # 404 에러!

실제 경로는:

/auth/actuator/health/liveness
/queue/actuator/health/liveness
...

2. 짧은 initialDelaySeconds (2차 원인)#

Context path 수정 후에도 문제 지속. 원인은 Java 앱 시작 시간.

기존 설정:

livenessProbe:
  initialDelaySeconds: 60  # 너무 짧음
readinessProbe:
  initialDelaySeconds: 30

Java Spring Boot 앱 시작 시간:

간단한 앱: 10-30초
중간 규모: 30-60초
복잡한 앱 (DB연결, 캐시 등): 60-120초+

Staging 환경에서 실제 관측된 시작 시간: 50-90초

Dev vs Staging 차이점#

Dev 환경 (미니PC)#

CPU: 고성능 데스크톱 CPU (Intel/AMD)
메모리: 충분한 RAM
디스크: NVMe SSD
JIT 컴파일: 빠름

Staging 환경 (EKS t4g.large)#

CPU: AWS Graviton2 (ARM64) 2 vCPU
메모리: 8GB (여러 Pod 공유)
Spot 인스턴스: 리소스 경쟁 가능
JIT 컴파일: ARM에서 약간 느릴 수 있음

결론: t4g.large가 나쁜 사양은 아니지만, Java cold start는 x86 대비 ARM에서 10-20% 느릴 수 있음. 또한 여러 Pod가 동시에 시작하면 리소스 경쟁으로 더 느려짐.

해결#

1. Helm Chart 수정 (java-service)#

deployment.yaml에서 contextPath 지원 추가:

livenessProbe:
  httpGet:
    path: {{ .Values.contextPath }}{{ .Values.livenessProbe.httpGet.path }}

2. Probe 설정 최적화#

권장 설정 (Java Spring Boot):

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 120    # 앱 시작 대기 (2분)
  periodSeconds: 10           # 체크 주기
  timeoutSeconds: 5           # 타임아웃
  failureThreshold: 6         # 실패 허용 횟수

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 60     # liveness보다 짧게
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 6

3. 서비스별 설정#

서비스	Port	Context Path	liveness delay	readiness delay
api-gateway	8085	(없음)	120s	60s
auth-guard	8080	/auth	120s	60s
queue	8081	/queue	120s	60s
seat	8082	/seat	120s	60s
order-core	8083	/order	120s	60s

Probe 설정 가이드#

initialDelaySeconds 결정 기준#

앱 시작 시간 측정: 로그에서 “Started Application in X seconds” 확인
버퍼 추가: 측정값 + 30-50%
환경 고려: 프로덕션은 staging보다 10-20% 더 여유있게

failureThreshold 설정#

failureThreshold * periodSeconds = 총 허용 시간
예: failureThreshold: 6, periodSeconds: 10 = 60초 추가 허용

liveness vs readiness#

구분	liveness	readiness
목적	앱이 죽었는지 확인	트래픽 받을 준비 확인
실패 시	컨테이너 재시작	서비스에서 제외
초기 지연	길게 (앱 완전 시작)	짧게 (빠른 서비스 투입)

수정된 파일 목록#

303-goormgb-k8s-helm#

common-charts/apps/java-service/values.yaml - contextPath 추가
common-charts/apps/java-service/templates/deployment.yaml - contextPath 적용
staging/values/apps/values-api-gateway.yaml - probe 설정
staging/values/apps/values-auth-guard.yaml - contextPath + probe
staging/values/apps/values-queue.yaml - contextPath + probe
staging/values/apps/values-seat.yaml - contextPath + probe
staging/values/apps/values-order-core.yaml - contextPath + probe

검증 명령어#

kubectl get pods -n staging-webs -w

# Pod 이벤트 확인 (probe 실패 로그)
kubectl describe pod <pod-name> -n staging-webs

# 앱 시작 로그 확인
kubectl logs <pod-name> -n staging-webs | grep -i "started"

# Probe 직접 테스트 (Pod 내부에서)
kubectl exec -it <pod-name> -n staging-webs -- curl localhost:8080/auth/actuator/health/liveness

교훈#

Java는 느리다: Cold start 시간을 충분히 고려해야 함
Context path 확인: Spring Boot의 server.servlet.context-path 설정 확인 필수
환경별 차이: Dev와 Staging/Prod의 인프라 차이 고려
로그 기반 튜닝: 실제 시작 시간 측정 후 probe 설정 조정

작성일: 2026-03-18 관련: Staging EKS 배포 트러블슈팅