Problem#

  • 1,253 argocd-repo-server pods created on the mini-gmk node (all in Evicted state)
  • DiskPressure: True
  • Multiple services in CrashLoopBackOff (ai-defense, seat, etc.)

Root Cause Analysis#

1. DiskPressure (mini-gmk)#

PathUsageCause
/opt/local-path-provisioner53GLoki chunks 40G + orphan PVs
Loki retentionnot configuredLogs accumulated indefinitely

2. CrashLoopBackOff#

ServiceCause
staging ai-defenseMissing imagePullSecrets: [] (default ecr-pull-secret being used)
dev ai-defenseMissing TM_OFFLINE_LLM_AUDIT_PATH env var (PermissionError: logs/)
dev seatMissing SPRING_KAFKA_BOOTSTRAP_SERVERS env var

Fix#

Resolving DiskPressure#

# 1. Delete orphan PVs (~8G freed)
sudo rm -rf /opt/local-path-provisioner/pvc-*_data_cloudbeaver-data
sudo rm -rf /opt/local-path-provisioner/pvc-*_monitoring_data-prometheus-*
# (PVs not currently in use)

# 2. Delete Loki chunks (~40G freed)
sudo find /opt/local-path-provisioner/pvc-*_monitoring_storage-loki-0/chunks -type f -mtime +1 -delete
sudo rm -rf /opt/local-path-provisioner/pvc-*_monitoring_storage-loki-0/chunks/*

# 3. Clean up system logs
sudo journalctl --vacuum-size=200M

# 4. Delete Evicted pod records
kubectl delete pods -A --field-selector=status.phase=Failed
kubectl delete pods -A --field-selector=status.phase=Succeeded

Helm Values Updates#

1. Add Loki retention (dev/staging)#

# dev/values/monitoring/values-loki.yaml
loki:
  compactor:
    retention_enabled: true
    delete_request_store: filesystem
  limits_config:
    retention_period: 72h  # 3 days

2. Add revisionHistoryLimit (common charts)#

# common-charts/apps/java-service/templates/deployment.yaml
# common-charts/apps/ai-service/templates/deployment.yaml
spec:
  revisionHistoryLimit: 3

3. staging ai-defense imagePullSecrets#

# staging/values/apps/values-ai-defense.yaml
imagePullSecrets: []  # Disabled since IRSA is used

4. dev ai-defense environment variable#

# dev/values/apps/values-ai-defense.yaml
env:
  - name: TM_OFFLINE_LLM_AUDIT_PATH
    value: "/tmp/logs/offline_llm_audit.jsonl"

5. dev seat Kafka config#

# dev/values/apps/values-seat.yaml
env:
  - name: SPRING_KAFKA_BOOTSTRAP_SERVERS
    value: "kafka.messaging.svc.cluster.local:9092"

Commit#

feat(sprint5/troubleshooting): resolve disk pressure and CrashLoopBackOff issues

- common-charts: add revisionHistoryLimit: 3
- dev/staging loki: add compactor for retention
- staging ai-defense: add imagePullSecrets: []
- dev ai-defense: add TM_OFFLINE_LLM_AUDIT_PATH
- dev seat: add Kafka bootstrap servers

Lessons Learned#

  1. Loki retention is mandatory — retention_period does nothing without the compactor
  2. local-path-provisioner orphan PVs — deleting a PVC does not automatically remove the directory on disk
  3. Staging uses IRSAimagePullSecrets: [] must be explicitly set to override the default
  4. revisionHistoryLimit — the default of 10 is too high; cap it at 3