문제 상황#

  • mini-gmk 노드에서 argocd-repo-server 파드가 1,253개 생성됨 (Evicted 상태)
  • DiskPressure: True
  • 여러 서비스 CrashLoopBackOff (ai-defense, seat 등)

원인 분석#

1. DiskPressure (mini-gmk)#

경로용량원인
/opt/local-path-provisioner53GLoki chunks 40G + orphan PV
Loki retention없음로그가 무한 축적

2. CrashLoopBackOff#

서비스원인
staging ai-defenseimagePullSecrets: [] 누락 (기본값 ecr-pull-secret 사용)
dev ai-defenseTM_OFFLINE_LLM_AUDIT_PATH 누락 (PermissionError: logs/)
dev seatSPRING_KAFKA_BOOTSTRAP_SERVERS 누락

해결#

DiskPressure 해결#

# 1. orphan PV 삭제 (~8G 확보)
sudo rm -rf /opt/local-path-provisioner/pvc-*_data_cloudbeaver-data
sudo rm -rf /opt/local-path-provisioner/pvc-*_monitoring_data-prometheus-*
# (사용 중이 아닌 PV들)

# 2. Loki chunks 삭제 (~40G 확보)
sudo find /opt/local-path-provisioner/pvc-*_monitoring_storage-loki-0/chunks -type f -mtime +1 -delete
sudo rm -rf /opt/local-path-provisioner/pvc-*_monitoring_storage-loki-0/chunks/*

# 3. 시스템 로그 정리
sudo journalctl --vacuum-size=200M

# 4. Evicted 파드 기록 삭제
kubectl delete pods -A --field-selector=status.phase=Failed
kubectl delete pods -A --field-selector=status.phase=Succeeded

Helm Values 수정#

1. Loki retention 추가 (dev/staging)#

# dev/values/monitoring/values-loki.yaml
loki:
  compactor:
    retention_enabled: true
    delete_request_store: filesystem
  limits_config:
    retention_period: 72h  # 3일

2. revisionHistoryLimit 추가 (공통 차트)#

# common-charts/apps/java-service/templates/deployment.yaml
# common-charts/apps/ai-service/templates/deployment.yaml
spec:
  revisionHistoryLimit: 3

3. staging ai-defense imagePullSecrets#

# staging/values/apps/values-ai-defense.yaml
imagePullSecrets: []  # IRSA 사용하므로 비활성화

4. dev ai-defense 환경변수#

# dev/values/apps/values-ai-defense.yaml
env:
  - name: TM_OFFLINE_LLM_AUDIT_PATH
    value: "/tmp/logs/offline_llm_audit.jsonl"

5. dev seat Kafka 설정#

# dev/values/apps/values-seat.yaml
env:
  - name: SPRING_KAFKA_BOOTSTRAP_SERVERS
    value: "kafka.messaging.svc.cluster.local:9092"

커밋#

feat(sprint5/troubleshooting): resolve disk pressure and CrashLoopBackOff issues

- common-charts: add revisionHistoryLimit: 3
- dev/staging loki: add compactor for retention
- staging ai-defense: add imagePullSecrets: []
- dev ai-defense: add TM_OFFLINE_LLM_AUDIT_PATH
- dev seat: add Kafka bootstrap servers

교훈#

  1. Loki retention 필수 - compactor 없으면 retention_period가 작동 안 함
  2. local-path-provisioner orphan PV - PVC 삭제해도 디렉토리는 자동 삭제 안 됨
  3. staging은 IRSA 사용 - imagePullSecrets: [] 명시 필요 (기본값 override)
  4. revisionHistoryLimit - 기본값 10은 너무 많음, 3으로 제한