Problem#

In the Grafana “Application Monitoring (Spring Boot)” dashboard:

  • Only gateway appeared in the application dropdown
  • Other services (auth-guard, order-core, seat, queue, etc.) were missing
  • Querying http_server_requests_seconds_count{namespace="dev-webs"} only returned results for gateway

Root Cause Analysis#

1. Two separate metric collection paths#

PathMetric nameLabelsTargets
PodMonitor → Prometheus direct scrapehttp_server_requests_seconds_*status, uri, namespacegateway only
OTel Agent → OTel Collector → Remote Writehttp_server_request_duration_seconds_*http_response_status_code, http_routeall services

2. Why only gateway had Micrometer metrics#

  • gateway: Spring Cloud Gateway with micrometer-registry-prometheus dependency included
  • Other services: OTel Java Agent only, no Micrometer dependency

3. Missing namespace label in OTel metrics#

The OTel Collector’s transform processor was configured to set the namespace label:

transform:
  metric_statements:
    - context: datapoint
      statements:
        - set(attributes["namespace"], resource.attributes["service.namespace"])

Problem: The OTel Java Agent was not setting the service.namespace resource attribute.

Fix#

1. Add OTEL_RESOURCE_ATTRIBUTES to Deployments#

java-service/templates/deployment.yaml:

{{- if .Values.otel.enabled }}
- name: JAVA_TOOL_OPTIONS
  value: "-javaagent:/otel/opentelemetry-javaagent.jar"
- name: OTEL_SERVICE_NAME
  value: {{ include "java-service.fullname" . }}
- name: OTEL_RESOURCE_ATTRIBUTES          # added
  value: "service.namespace={{ .Release.Namespace }}"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
  value: {{ .Values.otel.collectorEndpoint | quote }}
{{- end }}

Same change applied to ai-service/templates/deployment.yaml

2. Update dashboard to use OTel metrics#

Metric name changes:

Micrometer (before)OTel (after)
http_server_requests_seconds_counthttp_server_request_duration_seconds_count
http_server_requests_seconds_buckethttp_server_request_duration_seconds_bucket
http_server_requests_seconds_sumhttp_server_request_duration_seconds_sum

Label name changes:

Micrometer (before)OTel (after)
statushttp_response_status_code
urihttp_route

3. Update template variable query#

# Before
label_values(http_server_requests_seconds_count{namespace="$namespace"}, app)

# After
label_values(http_server_request_duration_seconds_count{namespace="$namespace"}, app)

4. Fix filter regex (Top 10 slow endpoints)#

Updated to also match patterns that appear in the middle of a context-path:

# Before: paths like /auth/actuator/health/** were not being filtered
http_route!~"/actuator.*|/swagger.*|..."

# After: match patterns anywhere in the route
http_route!~".*actuator.*|.*swagger.*|.*health.*|..."

Modified Files#

  1. common-charts/apps/java-service/templates/deployment.yaml

    • Added OTEL_RESOURCE_ATTRIBUTES env var
  2. common-charts/apps/ai-service/templates/deployment.yaml

    • Added OTEL_RESOURCE_ATTRIBUTES env var
  3. common-charts/infra/monitoring/prometheus-stack/files/grafana-dashboards/dev-team-observability/application-monitoring-springboot.json

    • Migrated to OTel metrics
  4. common-charts/infra/monitoring/prometheus-stack/files/grafana-dashboards/infra-team-observability/k6/k6-overview.json

    • Migrated to OTel metrics
    • Removed RDS/Redis template variables
    • Fixed PostgreSQL query label (datname → removed)

Metric Flow#

┌─────────────────┐
│  Java Service   │
│  (OTel Agent)   │
└────────┬────────┘
         │ OTLP (gRPC :4317)
         │ service.namespace=dev-webs
         ▼
┌─────────────────┐
│  OTel Collector │
│   (transform)   │
│                 │
│ namespace ←     │
│ service.namespace
└────────┬────────┘
         │ Remote Write
         ▼
┌─────────────────┐
│   Prometheus    │
│                 │
│ http_server_    │
│ request_duration│
│ _seconds_count  │
│ {namespace=     │
│  "dev-webs"}    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    Grafana      │
│   Dashboard     │
└─────────────────┘

Deployment Order#

  1. Deploy java-service and ai-service changes

    • ArgoCD Sync for dev-webs, dev-ai apps
    • Pod restarts to apply OTEL_RESOURCE_ATTRIBUTES
  2. Deploy prometheus-stack

    • Dashboard ConfigMap updated
  3. Verify

    • Query http_server_request_duration_seconds_count{namespace="dev-webs"} in Prometheus
    • All services (gateway, auth-guard, order-core, seat, queue) should appear

OTel vs Micrometer: Why OTel?#

ItemOTelMicrometer
StandardCNCF standard, vendor-neutralSpring ecosystem
DependencyOTel Agent only (runtime)Requires micrometer-registry-prometheus
ConfigurationEnvironment variablesapplication.yml
Metric namesOpenTelemetry Semantic ConventionsSpring Boot naming

Conclusion: Since all services already have the OTel Agent attached, standardize on OTel metrics across the board.