Problem#

Java services in the EKS Staging environment failing to start in CrashLoopBackOff. Log analysis revealed that the OTEL (OpenTelemetry) agent was failing to connect to the OTEL Collector.

Environment#

  • Kubernetes: EKS 1.34
  • Istio: mTLS STRICT mode enabled
  • OTEL Agent: opentelemetry-javaagent v2.11.0
  • OTEL Collector: opentelemetry-collector (Deployment)
  • Namespace: staging-webs (apps), monitoring (OTEL Collector)

Symptoms#

Pod Status#

staging-webs  auth-guard-xxx   0/1   CrashLoopBackOff   17
staging-webs  order-core-xxx   0/1   CrashLoopBackOff   21
staging-webs  queue-xxx        0/1   CrashLoopBackOff   22
staging-webs  seat-xxx         0/1   CrashLoopBackOff   21

Error Logs#

[otel.javaagent] ERROR io.opentelemetry.exporter.internal.grpc.GrpcExporter -
Failed to export metrics. Server is UNAVAILABLE.
Make sure your collector is running and reachable from this network.
Full error message: upstream connect error or disconnect/reset before headers.
retried and the latest reset reason: remote connection failure,
transport failure reason: TLS_error:|268435703:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER:TLS_error_end

Root Cause Analysis#

1. Normal Istio mTLS Traffic Flow#

For typical service-to-service communication:

┌─────────────────────────────────────────────────────────────────┐
│                    Normal mTLS Communication                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  [Pod A]                              [Pod B]                   │
│  ┌────────────┐                      ┌────────────┐             │
│  │  App A     │                      │  App B     │             │
│  │    │       │                      │    ↑       │             │
│  │    ↓       │                      │    │       │             │
│  │  Envoy     │ ════ mTLS ════════>  │  Envoy     │             │
│  │  Sidecar   │   (auto-encrypted)   │  Sidecar   │             │
│  └────────────┘                      └────────────┘             │
│                                                                 │
│  * App sends plain HTTP requests                                │
│  * Envoy sidecar handles mTLS automatically                     │
│  * Both sides have Istio sidecars - no issue                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2. How the OTEL Agent Behaves Differently#

The OTEL Java agent connects directly from within the JVM:

┌─────────────────────────────────────────────────────────────────┐
│                    OTEL Agent Communication Problem              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  [Java Pod]                          [OTEL Collector Pod]       │
│  ┌────────────────────┐              ┌────────────┐             │
│  │  ┌──────────────┐  │              │  Collector │             │
│  │  │  Spring App  │  │              │    ↑       │             │
│  │  │      +       │──┼─ HTTP ──────>│    │       │  ❌ FAIL    │
│  │  │  OTEL Agent  │  │   (direct)   │  Envoy     │             │
│  │  └──────────────┘  │              │  Sidecar   │             │
│  │         │          │              │  (mTLS     │             │
│  │         ↓          │              │   STRICT)  │             │
│  │     Envoy Sidecar  │              └────────────┘             │
│  │     (bypassed!)    │                                         │
│  └────────────────────┘                                         │
│                                                                 │
│  Issues:                                                        │
│  1. OTEL agent connects directly via gRPC/HTTP from within JVM  │
│  2. Does not go through (or incompletely goes through) sidecar  │
│  3. OTEL Collector's sidecar expects mTLS                       │
│  4. Result: TLS handshake failure                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

3. Interpreting the Error Message#

TLS_error:|268435703:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER
CodeMeaning
WRONG_VERSION_NUMBERTLS protocol version mismatch
Cause 1Client sent plain HTTP, but server expects TLS
Cause 2Client sent TLS, but server expects plain HTTP
Cause 3TLS version mismatch (e.g. TLS 1.2 vs 1.3)

In this case: OTEL agent sent plain HTTP (gRPC), and Istio mTLS interfered mid-way, causing a mismatch.


Fix#

Core Idea#

Add a mTLS exception for the OTEL Collector only (PERMISSIVE mode)

┌─────────────────────────────────────────────────────────────────┐
│                    Fixed Communication Flow                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  [Java Pod]                          [OTEL Collector Pod]       │
│  ┌────────────────────┐              ┌────────────┐             │
│  │  ┌──────────────┐  │              │  Collector │             │
│  │  │  Spring App  │  │              │    ↑       │             │
│  │  │      +       │──┼─ HTTP ──────>│    │       │  ✅ OK      │
│  │  │  OTEL Agent  │  │              │  Envoy     │             │
│  │  └──────────────┘  │              │  Sidecar   │             │
│  │                    │              │ (PERMISSIVE│             │
│  │                    │              │  = HTTP OK)│             │
│  └────────────────────┘              └────────────┘             │
│                                                                 │
│  PeerAuthentication: PERMISSIVE                                 │
│  → accepts both mTLS and plain HTTP                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

1. Add DestinationRule#

Disable mTLS for client (app) → OTEL Collector connections

File: common-charts/infra/monitoring/otel-collector/templates/destinationrule.yaml

# mTLS exception - DISABLE required since OTEL agent connects directly
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: otel-collector-mtls-disable
  namespace: {{ .Release.Namespace }}
spec:
  host: otel-collector.{{ .Release.Namespace }}.svc.cluster.local
  trafficPolicy:
    tls:
      mode: DISABLE

Role:

  • Instructs the client Envoy sidecar: “don’t use mTLS when connecting to the OTEL Collector”
  • Traffic is sent without encryption

2. Add PeerAuthentication#

Allow the OTEL Collector to accept plain HTTP connections

File: common-charts/infra/monitoring/otel-collector/templates/peerauthentication.yaml

# OTEL Collector - mTLS PERMISSIVE setting
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: otel-collector
  namespace: {{ .Release.Namespace }}
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: opentelemetry-collector
  mtls:
    mode: PERMISSIVE
  portLevelMtls:
    # gRPC port - the port OTEL agent connects to
    4317:
      mode: PERMISSIVE
    # HTTP port
    4318:
      mode: PERMISSIVE

Role:

  • Instructs the OTEL Collector’s Envoy sidecar: “accept both mTLS and plain HTTP”
  • Per-port configuration is possible (4317: gRPC, 4318: HTTP)

Istio mTLS Mode Comparison#

ModeDescriptionUse Case
STRICTmTLS onlyGeneral service-to-service communication (default)
PERMISSIVEAccept both mTLS and plain HTTPMigration, special clients
DISABLEmTLS disabledExternal services, legacy systems

DestinationRule vs PeerAuthentication#

ResourceApplies ToRole
DestinationRuleClient (sender)“How to connect when going to this service”
PeerAuthenticationServer (receiver)“What connections to accept”

Why both are needed:

  1. DestinationRule only: client sends plain HTTP, but server rejects it
  2. PeerAuthentication only: server is ready, but client still tries mTLS

Traffic Policy After Fix#

Traffic PathmTLS Setting
App ↔ App (within staging-webs)STRICT (unchanged)
App → OTEL CollectorDISABLE/PERMISSIVE (exception)
Prometheus → App metricsSTRICT (unchanged)
External → Istio GatewayTLS (certificate)

Deployment Commands#

1. Commit & push changes#

cd /Users/wonny/Documents/GitHub/303-goormgb-k8s-helm
git add -A
git commit -m "fix(otel): add mTLS exception for otel-collector"
git push

2. ArgoCD Sync (Bastion)#

# Sync OTEL Collector (apply mTLS exception)
argocd app sync staging-otel-collector --force

# Verify
kubectl get destinationrule -n monitoring
kubectl get peerauthentication -n monitoring

3. Restart app services#

# Restart all deployments
kubectl rollout restart deployment -n staging-webs

# Check status
kubectl get pods -n staging-webs -w

Verification#

1. Check Pod status#

kubectl get pods -n staging-webs
# All Pods should be Running with READY 2/2

2. Check OTEL Collector connection#

# No OTEL errors should appear in Pod logs
kubectl logs -n staging-webs -l app.kubernetes.io/name=auth-guard --tail=50 | grep -i otel

3. Verify trace collection#

Grafana → Tempo → check service traces


Alternative Approaches#

Option 1: Remove OTEL Collector sidecar#

# Add annotation to OTEL Collector deployment
metadata:
  annotations:
    sidecar.istio.io/inject: "false"

Pros: Simple
Cons: OTEL Collector loses mTLS protection

Option 2: OTEL Agent uses TLS#

# Change OTEL endpoint in app values
otel:
  collectorEndpoint: "https://otel-collector.monitoring.svc.cluster.local:4317"
  # + TLS certificate configuration required

Pros: Maintains security
Cons: Complex certificate management

Option 3: Current approach (PERMISSIVE)#

Pros:

  • Minimal changes to existing configuration
  • Only the OTEL Collector is excepted
  • All other services remain STRICT

Cons:

  • Traffic to the OTEL Collector is unencrypted
  • (Acceptable since it’s internal network traffic)

FileRole
staging/values/values-istio-security.yamlGlobal mTLS settings
common-charts/infra/monitoring/otel-collector/templates/destinationrule.yamlDestinationRule for OTEL
common-charts/infra/monitoring/otel-collector/templates/peerauthentication.yamlPeerAuthentication for OTEL
common-charts/apps/java-service/templates/deployment.yamlOTEL agent configuration
common-charts/apps/java-service/values.yamlOTEL endpoint default values

Lessons Learned#

  1. Istio mTLS is sidecar-based: traffic that bypasses the sidecar can cause issues
  2. OTEL agent is a special case: it connects directly from within the JVM, bypassing the sidecar
  3. PERMISSIVE is for migration/exceptions: STRICT is recommended in general; use exceptions only for special cases
  4. Both DestinationRule and PeerAuthentication are needed: configure both the client and server sides

References#