OTel Metrics Dashboard Troubleshooting
Problem#
In the Grafana “Application Monitoring (Spring Boot)” dashboard:
- Only gateway appeared in the application dropdown
- Other services (auth-guard, order-core, seat, queue, etc.) were missing
- Querying
http_server_requests_seconds_count{namespace="dev-webs"}only returned results for gateway
Root Cause Analysis#
1. Two separate metric collection paths#
| Path | Metric name | Labels | Targets |
|---|---|---|---|
| PodMonitor → Prometheus direct scrape | http_server_requests_seconds_* | status, uri, namespace | gateway only |
| OTel Agent → OTel Collector → Remote Write | http_server_request_duration_seconds_* | http_response_status_code, http_route | all services |
2. Why only gateway had Micrometer metrics#
- gateway: Spring Cloud Gateway with
micrometer-registry-prometheusdependency included - Other services: OTel Java Agent only, no Micrometer dependency
3. Missing namespace label in OTel metrics#
The OTel Collector’s transform processor was configured to set the namespace label:
transform:
metric_statements:
- context: datapoint
statements:
- set(attributes["namespace"], resource.attributes["service.namespace"])
Problem: The OTel Java Agent was not setting the service.namespace resource attribute.
Fix#
1. Add OTEL_RESOURCE_ATTRIBUTES to Deployments#
java-service/templates/deployment.yaml:
{{- if .Values.otel.enabled }}
- name: JAVA_TOOL_OPTIONS
value: "-javaagent:/otel/opentelemetry-javaagent.jar"
- name: OTEL_SERVICE_NAME
value: {{ include "java-service.fullname" . }}
- name: OTEL_RESOURCE_ATTRIBUTES # added
value: "service.namespace={{ .Release.Namespace }}"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: {{ .Values.otel.collectorEndpoint | quote }}
{{- end }}
Same change applied to ai-service/templates/deployment.yaml
2. Update dashboard to use OTel metrics#
Metric name changes:
| Micrometer (before) | OTel (after) |
|---|---|
http_server_requests_seconds_count | http_server_request_duration_seconds_count |
http_server_requests_seconds_bucket | http_server_request_duration_seconds_bucket |
http_server_requests_seconds_sum | http_server_request_duration_seconds_sum |
Label name changes:
| Micrometer (before) | OTel (after) |
|---|---|
status | http_response_status_code |
uri | http_route |
3. Update template variable query#
# Before
label_values(http_server_requests_seconds_count{namespace="$namespace"}, app)
# After
label_values(http_server_request_duration_seconds_count{namespace="$namespace"}, app)
4. Fix filter regex (Top 10 slow endpoints)#
Updated to also match patterns that appear in the middle of a context-path:
# Before: paths like /auth/actuator/health/** were not being filtered
http_route!~"/actuator.*|/swagger.*|..."
# After: match patterns anywhere in the route
http_route!~".*actuator.*|.*swagger.*|.*health.*|..."
Modified Files#
common-charts/apps/java-service/templates/deployment.yaml- Added OTEL_RESOURCE_ATTRIBUTES env var
common-charts/apps/ai-service/templates/deployment.yaml- Added OTEL_RESOURCE_ATTRIBUTES env var
common-charts/infra/monitoring/prometheus-stack/files/grafana-dashboards/dev-team-observability/application-monitoring-springboot.json- Migrated to OTel metrics
common-charts/infra/monitoring/prometheus-stack/files/grafana-dashboards/infra-team-observability/k6/k6-overview.json- Migrated to OTel metrics
- Removed RDS/Redis template variables
- Fixed PostgreSQL query label (
datname→ removed)
Metric Flow#
┌─────────────────┐
│ Java Service │
│ (OTel Agent) │
└────────┬────────┘
│ OTLP (gRPC :4317)
│ service.namespace=dev-webs
▼
┌─────────────────┐
│ OTel Collector │
│ (transform) │
│ │
│ namespace ← │
│ service.namespace
└────────┬────────┘
│ Remote Write
▼
┌─────────────────┐
│ Prometheus │
│ │
│ http_server_ │
│ request_duration│
│ _seconds_count │
│ {namespace= │
│ "dev-webs"} │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Grafana │
│ Dashboard │
└─────────────────┘
Deployment Order#
Deploy java-service and ai-service changes
- ArgoCD Sync for dev-webs, dev-ai apps
- Pod restarts to apply OTEL_RESOURCE_ATTRIBUTES
Deploy prometheus-stack
- Dashboard ConfigMap updated
Verify
- Query
http_server_request_duration_seconds_count{namespace="dev-webs"}in Prometheus - All services (gateway, auth-guard, order-core, seat, queue) should appear
- Query
OTel vs Micrometer: Why OTel?#
| Item | OTel | Micrometer |
|---|---|---|
| Standard | CNCF standard, vendor-neutral | Spring ecosystem |
| Dependency | OTel Agent only (runtime) | Requires micrometer-registry-prometheus |
| Configuration | Environment variables | application.yml |
| Metric names | OpenTelemetry Semantic Conventions | Spring Boot naming |
Conclusion: Since all services already have the OTel Agent attached, standardize on OTel metrics across the board.