Problem#

Monitoring dashboards were broken because the infrastructure setup differs between the Dev (kubeadm) and Staging (EKS) environments.

Infrastructure differences by environment#

ComponentDev (kubeadm)Staging (EKS)
PostgreSQLLocal Pod (postgres:16-alpine)AWS RDS
RedisLocal Pod (redis:7-alpine)AWS ElastiCache
Metrics collectionPrometheus exporterCloudWatch
Data sourcePrometheusCloudWatch

Symptoms#

  • “RDS PostgreSQL Monitoring” dashboard in Dev → No data
  • “ElastiCache Redis Monitoring” dashboard in Dev → No data
  • Cause: no CloudWatch data source (not an AWS environment)

Fix#

1. Separate dashboards by environment#

Created two versions of the same dashboard:

DashboardData sourceEnvironment
database-rds-postgresql.jsonCloudWatchStaging
database-postgresql-local.jsonPrometheusDev
cache-elasticache-redis.jsonCloudWatchStaging
cache-redis-local.jsonPrometheusDev

2. Per-environment exclude configuration#

Exclude dashboards that don’t apply to each environment in the Helm values:

Dev environment (dev/values/monitoring/values-prometheus-stack.yaml):

dashboards:
  devTeam:
    exclude:
      # Exclude CloudWatch-based dashboards (Dev uses local exporters)
      - "database-rds-postgresql.json"
      - "cache-elasticache-redis.json"

Staging environment (staging/values/values-prometheus-stack.yaml):

dashboards:
  devTeam:
    exclude:
      # Exclude local exporter dashboards (EKS uses CloudWatch)
      - "database-postgresql-local.json"
      - "cache-redis-local.json"

3. Update ConfigMap template#

Modified the dashboard ConfigMap generation to respect the exclude list:

{{- $dashboardFiles := .Files.Glob "files/grafana-dashboards/dev-team-observability/*.json" -}}
{{- $excludeList := .Values.dashboards.devTeam.exclude | default list -}}
{{- if gt (len $dashboardFiles) 0 }}
...
{{- range $path, $_ := $dashboardFiles }}
{{- $filename := base $path -}}
{{- if not (has $filename $excludeList) }}
  {{ $filename }}: |-
{{ $.Files.Get $path | nindent 4 }}
{{- end }}
{{- end }}
{{- end }}

Metric Comparison#

PostgreSQL#

MetricDev (postgres_exporter)Staging (CloudWatch)
CPUprocess_cpu_seconds_totalCPUUtilization
Connectionspg_stat_activity_countDatabaseConnections
Cache hitpg_stat_database_blks_hitN/A (RDS-managed)
DB sizepg_database_size_bytesFreeStorageSpace (inverse)

Redis#

MetricDev (redis_exporter)Staging (CloudWatch)
CPUredis_cpu_sys_seconds_totalCPUUtilization, EngineCPUUtilization
Memoryredis_memory_used_bytesBytesUsedForCache
Connectionsredis_connected_clientsCurrConnections
Cache hitredis_keyspace_hits_totalCacheHits

File Changes#

Added files#

  • common-charts/.../dev-team-observability/database-postgresql-local.json
  • common-charts/.../dev-team-observability/cache-redis-local.json

Modified files#

  • common-charts/.../templates/grafana-dashboards-dev-team-observability-configmap.yaml
    • Added exclude logic
  • common-charts/.../values.yaml
    • Added dashboards.devTeam.exclude default value
  • dev/values/monitoring/values-prometheus-stack.yaml
    • Exclude CloudWatch dashboards
  • staging/values/values-prometheus-stack.yaml
    • Exclude local exporter dashboards

Symptoms#

istio_requests_total metrics were not being scraped by Prometheus in Staging.

Cause#

Prometheus was configured to only select PodMonitors/ServiceMonitors with the label release: kube-prometheus-stack, but the Istio PodMonitor had release: prometheus.

Fix#

Unified all monitor labels to release: kube-prometheus-stack:

# Before
labels:
  release: prometheus

# After
labels:
  release: kube-prometheus-stack

Modified files:

  • common-charts/infra/istio/istiod/templates/podmonitor-envoy.yaml
  • common-charts/infra/istio/istiod/templates/servicemonitor.yaml
  • common-charts/infra/argocd/templates/servicemonitor.yaml

Dev vs Staging Prometheus configuration difference#

SettingDevStaging
serviceMonitorSelectorNilUsesHelmValuesfalse (collect all monitors)default (label-filtered)
podMonitorSelectorNilUsesHelmValuesfalse (collect all monitors)default (label-filtered)

Dev collects all monitors regardless of labels. Staging filters by label, so labels must match.


Verification#

# 1. Check Prometheus targets
kubectl exec -n monitoring sts/prometheus-prom -c prometheus -- \
  wget -qO- 'http://localhost:9090/api/v1/targets' | grep envoy

# 2. Query Istio metrics
kubectl exec -n monitoring sts/prometheus-prom -c prometheus -- \
  wget -qO- 'http://localhost:9090/api/v1/query?query=istio_requests_total'

# 3. Check dashboard ConfigMap
kubectl get configmap grafana-dashboards-dev-team-observability -n monitoring -o yaml | grep -E "^\s+\w+\.json"

References#

  • Prometheus postgres_exporter: prometheuscommunity/postgres-exporter:v0.15.0
  • Prometheus redis_exporter: oliver006/redis_exporter:v1.66.0
  • CloudWatch metric namespaces:
    • RDS: AWS/RDS
    • ElastiCache: AWS/ElastiCache