문제 상황#
- mini-gmk 노드에서 argocd-repo-server 파드가 1,253개 생성됨 (Evicted 상태)
- DiskPressure: True
- 여러 서비스 CrashLoopBackOff (ai-defense, seat 등)
원인 분석#
1. DiskPressure (mini-gmk)#
| 경로 | 용량 | 원인 |
|---|
| /opt/local-path-provisioner | 53G | Loki chunks 40G + orphan PV |
| Loki retention | 없음 | 로그가 무한 축적 |
2. CrashLoopBackOff#
| 서비스 | 원인 |
|---|
| staging ai-defense | imagePullSecrets: [] 누락 (기본값 ecr-pull-secret 사용) |
| dev ai-defense | TM_OFFLINE_LLM_AUDIT_PATH 누락 (PermissionError: logs/) |
| dev seat | SPRING_KAFKA_BOOTSTRAP_SERVERS 누락 |
DiskPressure 해결#
# 1. orphan PV 삭제 (~8G 확보)
sudo rm -rf /opt/local-path-provisioner/pvc-*_data_cloudbeaver-data
sudo rm -rf /opt/local-path-provisioner/pvc-*_monitoring_data-prometheus-*
# (사용 중이 아닌 PV들)
# 2. Loki chunks 삭제 (~40G 확보)
sudo find /opt/local-path-provisioner/pvc-*_monitoring_storage-loki-0/chunks -type f -mtime +1 -delete
sudo rm -rf /opt/local-path-provisioner/pvc-*_monitoring_storage-loki-0/chunks/*
# 3. 시스템 로그 정리
sudo journalctl --vacuum-size=200M
# 4. Evicted 파드 기록 삭제
kubectl delete pods -A --field-selector=status.phase=Failed
kubectl delete pods -A --field-selector=status.phase=Succeeded
Helm Values 수정#
1. Loki retention 추가 (dev/staging)#
# dev/values/monitoring/values-loki.yaml
loki:
compactor:
retention_enabled: true
delete_request_store: filesystem
limits_config:
retention_period: 72h # 3일
2. revisionHistoryLimit 추가 (공통 차트)#
# common-charts/apps/java-service/templates/deployment.yaml
# common-charts/apps/ai-service/templates/deployment.yaml
spec:
revisionHistoryLimit: 3
3. staging ai-defense imagePullSecrets#
# staging/values/apps/values-ai-defense.yaml
imagePullSecrets: [] # IRSA 사용하므로 비활성화
4. dev ai-defense 환경변수#
# dev/values/apps/values-ai-defense.yaml
env:
- name: TM_OFFLINE_LLM_AUDIT_PATH
value: "/tmp/logs/offline_llm_audit.jsonl"
5. dev seat Kafka 설정#
# dev/values/apps/values-seat.yaml
env:
- name: SPRING_KAFKA_BOOTSTRAP_SERVERS
value: "kafka.messaging.svc.cluster.local:9092"
feat(sprint5/troubleshooting): resolve disk pressure and CrashLoopBackOff issues
- common-charts: add revisionHistoryLimit: 3
- dev/staging loki: add compactor for retention
- staging ai-defense: add imagePullSecrets: []
- dev ai-defense: add TM_OFFLINE_LLM_AUDIT_PATH
- dev seat: add Kafka bootstrap servers
- Loki retention 필수 - compactor 없으면 retention_period가 작동 안 함
- local-path-provisioner orphan PV - PVC 삭제해도 디렉토리는 자동 삭제 안 됨
- staging은 IRSA 사용 - imagePullSecrets: [] 명시 필요 (기본값 override)
- revisionHistoryLimit - 기본값 10은 너무 많음, 3으로 제한