OpenTelemetry + Istio mTLS Troubleshooting
Problem#
Java services in the EKS Staging environment failing to start in CrashLoopBackOff. Log analysis revealed that the OTEL (OpenTelemetry) agent was failing to connect to the OTEL Collector.
Environment#
- Kubernetes: EKS 1.34
- Istio: mTLS STRICT mode enabled
- OTEL Agent: opentelemetry-javaagent v2.11.0
- OTEL Collector: opentelemetry-collector (Deployment)
- Namespace: staging-webs (apps), monitoring (OTEL Collector)
Symptoms#
Pod Status#
staging-webs auth-guard-xxx 0/1 CrashLoopBackOff 17
staging-webs order-core-xxx 0/1 CrashLoopBackOff 21
staging-webs queue-xxx 0/1 CrashLoopBackOff 22
staging-webs seat-xxx 0/1 CrashLoopBackOff 21
Error Logs#
[otel.javaagent] ERROR io.opentelemetry.exporter.internal.grpc.GrpcExporter -
Failed to export metrics. Server is UNAVAILABLE.
Make sure your collector is running and reachable from this network.
Full error message: upstream connect error or disconnect/reset before headers.
retried and the latest reset reason: remote connection failure,
transport failure reason: TLS_error:|268435703:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER:TLS_error_end
Root Cause Analysis#
1. Normal Istio mTLS Traffic Flow#
For typical service-to-service communication:
┌─────────────────────────────────────────────────────────────────┐
│ Normal mTLS Communication │
├─────────────────────────────────────────────────────────────────┤
│ │
│ [Pod A] [Pod B] │
│ ┌────────────┐ ┌────────────┐ │
│ │ App A │ │ App B │ │
│ │ │ │ │ ↑ │ │
│ │ ↓ │ │ │ │ │
│ │ Envoy │ ════ mTLS ════════> │ Envoy │ │
│ │ Sidecar │ (auto-encrypted) │ Sidecar │ │
│ └────────────┘ └────────────┘ │
│ │
│ * App sends plain HTTP requests │
│ * Envoy sidecar handles mTLS automatically │
│ * Both sides have Istio sidecars - no issue │
│ │
└─────────────────────────────────────────────────────────────────┘
2. How the OTEL Agent Behaves Differently#
The OTEL Java agent connects directly from within the JVM:
┌─────────────────────────────────────────────────────────────────┐
│ OTEL Agent Communication Problem │
├─────────────────────────────────────────────────────────────────┤
│ │
│ [Java Pod] [OTEL Collector Pod] │
│ ┌────────────────────┐ ┌────────────┐ │
│ │ ┌──────────────┐ │ │ Collector │ │
│ │ │ Spring App │ │ │ ↑ │ │
│ │ │ + │──┼─ HTTP ──────>│ │ │ ❌ FAIL │
│ │ │ OTEL Agent │ │ (direct) │ Envoy │ │
│ │ └──────────────┘ │ │ Sidecar │ │
│ │ │ │ │ (mTLS │ │
│ │ ↓ │ │ STRICT) │ │
│ │ Envoy Sidecar │ └────────────┘ │
│ │ (bypassed!) │ │
│ └────────────────────┘ │
│ │
│ Issues: │
│ 1. OTEL agent connects directly via gRPC/HTTP from within JVM │
│ 2. Does not go through (or incompletely goes through) sidecar │
│ 3. OTEL Collector's sidecar expects mTLS │
│ 4. Result: TLS handshake failure │
│ │
└─────────────────────────────────────────────────────────────────┘
3. Interpreting the Error Message#
TLS_error:|268435703:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER
| Code | Meaning |
|---|---|
WRONG_VERSION_NUMBER | TLS protocol version mismatch |
| Cause 1 | Client sent plain HTTP, but server expects TLS |
| Cause 2 | Client sent TLS, but server expects plain HTTP |
| Cause 3 | TLS version mismatch (e.g. TLS 1.2 vs 1.3) |
In this case: OTEL agent sent plain HTTP (gRPC), and Istio mTLS interfered mid-way, causing a mismatch.
Fix#
Core Idea#
Add a mTLS exception for the OTEL Collector only (PERMISSIVE mode)
┌─────────────────────────────────────────────────────────────────┐
│ Fixed Communication Flow │
├─────────────────────────────────────────────────────────────────┤
│ │
│ [Java Pod] [OTEL Collector Pod] │
│ ┌────────────────────┐ ┌────────────┐ │
│ │ ┌──────────────┐ │ │ Collector │ │
│ │ │ Spring App │ │ │ ↑ │ │
│ │ │ + │──┼─ HTTP ──────>│ │ │ ✅ OK │
│ │ │ OTEL Agent │ │ │ Envoy │ │
│ │ └──────────────┘ │ │ Sidecar │ │
│ │ │ │ (PERMISSIVE│ │
│ │ │ │ = HTTP OK)│ │
│ └────────────────────┘ └────────────┘ │
│ │
│ PeerAuthentication: PERMISSIVE │
│ → accepts both mTLS and plain HTTP │
│ │
└─────────────────────────────────────────────────────────────────┘
1. Add DestinationRule#
Disable mTLS for client (app) → OTEL Collector connections
File: common-charts/infra/monitoring/otel-collector/templates/destinationrule.yaml
# mTLS exception - DISABLE required since OTEL agent connects directly
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
name: otel-collector-mtls-disable
namespace: {{ .Release.Namespace }}
spec:
host: otel-collector.{{ .Release.Namespace }}.svc.cluster.local
trafficPolicy:
tls:
mode: DISABLE
Role:
- Instructs the client Envoy sidecar: “don’t use mTLS when connecting to the OTEL Collector”
- Traffic is sent without encryption
2. Add PeerAuthentication#
Allow the OTEL Collector to accept plain HTTP connections
File: common-charts/infra/monitoring/otel-collector/templates/peerauthentication.yaml
# OTEL Collector - mTLS PERMISSIVE setting
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
name: otel-collector
namespace: {{ .Release.Namespace }}
spec:
selector:
matchLabels:
app.kubernetes.io/name: opentelemetry-collector
mtls:
mode: PERMISSIVE
portLevelMtls:
# gRPC port - the port OTEL agent connects to
4317:
mode: PERMISSIVE
# HTTP port
4318:
mode: PERMISSIVE
Role:
- Instructs the OTEL Collector’s Envoy sidecar: “accept both mTLS and plain HTTP”
- Per-port configuration is possible (4317: gRPC, 4318: HTTP)
Istio mTLS Mode Comparison#
| Mode | Description | Use Case |
|---|---|---|
STRICT | mTLS only | General service-to-service communication (default) |
PERMISSIVE | Accept both mTLS and plain HTTP | Migration, special clients |
DISABLE | mTLS disabled | External services, legacy systems |
DestinationRule vs PeerAuthentication#
| Resource | Applies To | Role |
|---|---|---|
DestinationRule | Client (sender) | “How to connect when going to this service” |
PeerAuthentication | Server (receiver) | “What connections to accept” |
Why both are needed:
- DestinationRule only: client sends plain HTTP, but server rejects it
- PeerAuthentication only: server is ready, but client still tries mTLS
Traffic Policy After Fix#
| Traffic Path | mTLS Setting |
|---|---|
| App ↔ App (within staging-webs) | STRICT (unchanged) |
| App → OTEL Collector | DISABLE/PERMISSIVE (exception) |
| Prometheus → App metrics | STRICT (unchanged) |
| External → Istio Gateway | TLS (certificate) |
Deployment Commands#
1. Commit & push changes#
cd /Users/wonny/Documents/GitHub/303-goormgb-k8s-helm
git add -A
git commit -m "fix(otel): add mTLS exception for otel-collector"
git push
2. ArgoCD Sync (Bastion)#
# Sync OTEL Collector (apply mTLS exception)
argocd app sync staging-otel-collector --force
# Verify
kubectl get destinationrule -n monitoring
kubectl get peerauthentication -n monitoring
3. Restart app services#
# Restart all deployments
kubectl rollout restart deployment -n staging-webs
# Check status
kubectl get pods -n staging-webs -w
Verification#
1. Check Pod status#
kubectl get pods -n staging-webs
# All Pods should be Running with READY 2/2
2. Check OTEL Collector connection#
# No OTEL errors should appear in Pod logs
kubectl logs -n staging-webs -l app.kubernetes.io/name=auth-guard --tail=50 | grep -i otel
3. Verify trace collection#
Grafana → Tempo → check service traces
Alternative Approaches#
Option 1: Remove OTEL Collector sidecar#
# Add annotation to OTEL Collector deployment
metadata:
annotations:
sidecar.istio.io/inject: "false"
Pros: Simple
Cons: OTEL Collector loses mTLS protection
Option 2: OTEL Agent uses TLS#
# Change OTEL endpoint in app values
otel:
collectorEndpoint: "https://otel-collector.monitoring.svc.cluster.local:4317"
# + TLS certificate configuration required
Pros: Maintains security
Cons: Complex certificate management
Option 3: Current approach (PERMISSIVE)#
Pros:
- Minimal changes to existing configuration
- Only the OTEL Collector is excepted
- All other services remain STRICT
Cons:
- Traffic to the OTEL Collector is unencrypted
- (Acceptable since it’s internal network traffic)
Related Files#
| File | Role |
|---|---|
staging/values/values-istio-security.yaml | Global mTLS settings |
common-charts/infra/monitoring/otel-collector/templates/destinationrule.yaml | DestinationRule for OTEL |
common-charts/infra/monitoring/otel-collector/templates/peerauthentication.yaml | PeerAuthentication for OTEL |
common-charts/apps/java-service/templates/deployment.yaml | OTEL agent configuration |
common-charts/apps/java-service/values.yaml | OTEL endpoint default values |
Lessons Learned#
- Istio mTLS is sidecar-based: traffic that bypasses the sidecar can cause issues
- OTEL agent is a special case: it connects directly from within the JVM, bypassing the sidecar
- PERMISSIVE is for migration/exceptions: STRICT is recommended in general; use exceptions only for special cases
- Both DestinationRule and PeerAuthentication are needed: configure both the client and server sides