Problem#
- 1,253 argocd-repo-server pods created on the mini-gmk node (all in Evicted state)
- DiskPressure: True
- Multiple services in CrashLoopBackOff (ai-defense, seat, etc.)
Root Cause Analysis#
1. DiskPressure (mini-gmk)#
| Path | Usage | Cause |
|---|
| /opt/local-path-provisioner | 53G | Loki chunks 40G + orphan PVs |
| Loki retention | not configured | Logs accumulated indefinitely |
2. CrashLoopBackOff#
| Service | Cause |
|---|
| staging ai-defense | Missing imagePullSecrets: [] (default ecr-pull-secret being used) |
| dev ai-defense | Missing TM_OFFLINE_LLM_AUDIT_PATH env var (PermissionError: logs/) |
| dev seat | Missing SPRING_KAFKA_BOOTSTRAP_SERVERS env var |
Fix#
Resolving DiskPressure#
# 1. Delete orphan PVs (~8G freed)
sudo rm -rf /opt/local-path-provisioner/pvc-*_data_cloudbeaver-data
sudo rm -rf /opt/local-path-provisioner/pvc-*_monitoring_data-prometheus-*
# (PVs not currently in use)
# 2. Delete Loki chunks (~40G freed)
sudo find /opt/local-path-provisioner/pvc-*_monitoring_storage-loki-0/chunks -type f -mtime +1 -delete
sudo rm -rf /opt/local-path-provisioner/pvc-*_monitoring_storage-loki-0/chunks/*
# 3. Clean up system logs
sudo journalctl --vacuum-size=200M
# 4. Delete Evicted pod records
kubectl delete pods -A --field-selector=status.phase=Failed
kubectl delete pods -A --field-selector=status.phase=Succeeded
Helm Values Updates#
1. Add Loki retention (dev/staging)#
# dev/values/monitoring/values-loki.yaml
loki:
compactor:
retention_enabled: true
delete_request_store: filesystem
limits_config:
retention_period: 72h # 3 days
2. Add revisionHistoryLimit (common charts)#
# common-charts/apps/java-service/templates/deployment.yaml
# common-charts/apps/ai-service/templates/deployment.yaml
spec:
revisionHistoryLimit: 3
3. staging ai-defense imagePullSecrets#
# staging/values/apps/values-ai-defense.yaml
imagePullSecrets: [] # Disabled since IRSA is used
4. dev ai-defense environment variable#
# dev/values/apps/values-ai-defense.yaml
env:
- name: TM_OFFLINE_LLM_AUDIT_PATH
value: "/tmp/logs/offline_llm_audit.jsonl"
5. dev seat Kafka config#
# dev/values/apps/values-seat.yaml
env:
- name: SPRING_KAFKA_BOOTSTRAP_SERVERS
value: "kafka.messaging.svc.cluster.local:9092"
Commit#
feat(sprint5/troubleshooting): resolve disk pressure and CrashLoopBackOff issues
- common-charts: add revisionHistoryLimit: 3
- dev/staging loki: add compactor for retention
- staging ai-defense: add imagePullSecrets: []
- dev ai-defense: add TM_OFFLINE_LLM_AUDIT_PATH
- dev seat: add Kafka bootstrap servers
Lessons Learned#
- Loki retention is mandatory — retention_period does nothing without the compactor
- local-path-provisioner orphan PVs — deleting a PVC does not automatically remove the directory on disk
- Staging uses IRSA —
imagePullSecrets: [] must be explicitly set to override the default - revisionHistoryLimit — the default of 10 is too high; cap it at 3