Troubleshooting on blog.212clab

Environment-Specific Monitoring Dashboard Separation

Mon, 20 Apr 2026 00:00:00 +0000

Problem

Monitoring dashboards were broken because the infrastructure setup differs between the Dev (kubeadm) and Staging (EKS) environments.

Infrastructure differences by environment

Component	Dev (kubeadm)	Staging (EKS)
PostgreSQL	Local Pod (postgres:16-alpine)	AWS RDS
Redis	Local Pod (redis:7-alpine)	AWS ElastiCache
Metrics collection	Prometheus exporter	CloudWatch
Data source	Prometheus	CloudWatch

Symptoms

“RDS PostgreSQL Monitoring” dashboard in Dev → No data
“ElastiCache Redis Monitoring” dashboard in Dev → No data
Cause: no CloudWatch data source (not an AWS environment)

Fix

1. Separate dashboards by environment

Created two versions of the same dashboard:

Monitoring Data S3 Object Storage Architecture

Mon, 20 Apr 2026 00:00:00 +0000

Problem

When Spot instances were reclaimed in the Staging environment and new nodes were provisioned in a different AZ, monitoring pods (Loki, Tempo, etc.) got stuck in Pending because they couldn’t mount their existing PVCs.

Symptoms

Spot node reclaimed → new node provisioned in a different AZ (e.g., ap-northeast-2a → 2c)
Monitoring pods rescheduled → existing PVC (EBS) was bound to the previous AZ, mount failed
Pod: Pending → Warning: FailedAttachVolume
Log and trace collection interrupted

Root cause: EBS is bound to a single AZ

[Before]
 Node (ap-northeast-2a) ← Spot
 └── Loki Pod
 └── PVC → EBS (pinned to ap-northeast-2a)

[After Spot reclaim]
 Node (ap-northeast-2c) ← newly provisioned
 └── Loki Pod (Pending)
 └── PVC → EBS (ap-northeast-2a) ← cannot be mounted!

EBS volumes can only be used in the AZ where they were created. Because Spot instances can be provisioned in any available AZ, AZ mismatches occurred frequently.

OTel Metrics Dashboard Troubleshooting

Mon, 20 Apr 2026 00:00:00 +0000

Problem

In the Grafana “Application Monitoring (Spring Boot)” dashboard:

Only gateway appeared in the application dropdown
Other services (auth-guard, order-core, seat, queue, etc.) were missing
Querying http_server_requests_seconds_count{namespace="dev-webs"} only returned results for gateway

Root Cause Analysis

1. Two separate metric collection paths

Path	Metric name	Labels	Targets
PodMonitor → Prometheus direct scrape	`http_server_requests_seconds_*`	`status`, `uri`, `namespace`	gateway only
OTel Agent → OTel Collector → Remote Write	`http_server_request_duration_seconds_*`	`http_response_status_code`, `http_route`	all services

2. Why only gateway had Micrometer metrics

gateway: Spring Cloud Gateway with micrometer-registry-prometheus dependency included
Other services: OTel Java Agent only, no Micrometer dependency

3. Missing namespace label in OTel metrics

The OTel Collector’s transform processor was configured to set the namespace label:

CloudTrail + S3 Bucket Policy Circular Dependency

Mon, 13 Apr 2026 00:00:00 +0000

Problem

Running terraform apply on stacks/audit-security/ to enable CloudTrail.

Error

Error: creating CloudTrail Trail (playball-audit-trail): InsufficientS3BucketPolicyException: 
Incorrect S3 bucket policy is detected for bucket: playball-audit-logs

Root Cause Analysis

Circular Dependency:

Creating CloudTrail → requires S3 bucket policy with write permission for CloudTrail
 ↕
S3 bucket policy → needs to reference the CloudTrail ARN (module.cloudtrail.source_arn)
 ↕
CloudTrail ARN → only exists after CloudTrail is created

The code used a dynamic "statement" to add the CloudTrail permission only when module.cloudtrail.source_arn != null. But since CloudTrail hasn’t been created yet, source_arn = null → dynamic block is skipped → bucket policy has no CloudTrail permission → CloudTrail creation fails.

Route53 Record Deletion + Secrets Manager Reset

Fri, 10 Apr 2026 00:00:00 +0000

2026-04-10 | Route53 deletion and Secret reset triggered by a team member’s terraform apply

Problem

Route53 records deleted: DNS records disappeared after EKS restart/scale-down
Secrets Manager reset: DB/Redis passwords and other secrets were overwritten on terraform apply

Root Cause 1: external-dns `policy: sync`

The issue

policy: sync  # allows record deletion

policy: sync deletes Route53 records when the corresponding Kubernetes Ingress/Service is removed. Bringing down or scaling in EKS → Ingress deleted → external-dns deletes Route53 records.

Istio Sidecar Startup Timing (holdApplicationUntilProxyStarts)

Wed, 08 Apr 2026 00:00:00 +0000

Problem

Pods in dev-webs were stuck at 1/2 Running. Readiness probes were failing with context deadline exceeded.

Symptoms

kubectl get pods -n dev-webs
# NAME READY STATUS RESTARTS
# dev-api-gateway-xxx 1/2 Running 0

kubectl describe pod dev-api-gateway-xxx -n dev-webs
# Readiness probe failed: Get "http://10.x.x.x:8085/actuator/health/readiness": context deadline exceeded

Root Cause

The app container started before the Istio sidecar (envoy-proxy) was fully initialized. When the app tried to connect to external services (DB, Redis, etc.), the sidecar wasn’t ready to handle traffic yet, causing the connections to fail.

k6-operator MaxVUs Parallelism Error

Sun, 05 Apr 2026 00:00:00 +0000

Problem

A 4000 VU load test failed immediately upon execution:

Status: error (Pod: Succeeded)
Summary not available (test may have ended too quickly)

k6-operator logs:

k6 inspect: {MaxVUs:1 ...}
ERROR: Parallelism argument cannot be larger than maximum VUs in the script
{"maxVUs": 1, "parallelism": 2, "error": "number of instances > number of VUs"}

Root Cause Analysis

k6-operator execution flow

1. TestRun CR created
 ↓
2. Initializer Pod runs
 - k6 inspect: analyzes script (determines maxVUs) ← no env vars here!
 - k6 archive: compresses script
 ↓
3. Runner Pod created (VUS, DURATION env vars injected)
 ↓
4. Test runs

Problematic code

// k6 script generated by scripts.go
const _vus = __ENV.K6_VUS ? parseInt(__ENV.K6_VUS) : (parseInt(__ENV.VUS) || 1);
// ↑
// default value 1!

Stage	Env vars	_vus value
k6 inspect (Initializer)	none	1 (default)
k6 run (Runner)	VUS=4000	4000

k6 inspect analyzes export const options in the script to determine maxVUs. Without env vars, the default value of 1 is used, so maxVUs is reported as 1.

DiskPressure & CrashLoopBackOff Troubleshooting

Thu, 02 Apr 2026 00:00:00 +0000

Problem

1,253 argocd-repo-server pods created on the mini-gmk node (all in Evicted state)
DiskPressure: True
Multiple services in CrashLoopBackOff (ai-defense, seat, etc.)

Root Cause Analysis

1. DiskPressure (mini-gmk)

Path	Usage	Cause
/opt/local-path-provisioner	53G	Loki chunks 40G + orphan PVs
Loki retention	not configured	Logs accumulated indefinitely

2. CrashLoopBackOff

Service	Cause
staging ai-defense	Missing `imagePullSecrets: []` (default ecr-pull-secret being used)
dev ai-defense	Missing `TM_OFFLINE_LLM_AUDIT_PATH` env var (PermissionError: logs/)
dev seat	Missing `SPRING_KAFKA_BOOTSTRAP_SERVERS` env var

Fix

Resolving DiskPressure

# 1. Delete orphan PVs (~8G freed)
sudo rm -rf /opt/local-path-provisioner/pvc-*_data_cloudbeaver-data
sudo rm -rf /opt/local-path-provisioner/pvc-*_monitoring_data-prometheus-*
# (PVs not currently in use)

# 2. Delete Loki chunks (~40G freed)
sudo find /opt/local-path-provisioner/pvc-*_monitoring_storage-loki-0/chunks -type f -mtime +1 -delete
sudo rm -rf /opt/local-path-provisioner/pvc-*_monitoring_storage-loki-0/chunks/*

# 3. Clean up system logs
sudo journalctl --vacuum-size=200M

# 4. Delete Evicted pod records
kubectl delete pods -A --field-selector=status.phase=Failed
kubectl delete pods -A --field-selector=status.phase=Succeeded

Helm Values Updates

1. Add Loki retention (dev/staging)

# dev/values/monitoring/values-loki.yaml
loki:
 compactor:
 retention_enabled: true
 delete_request_store: filesystem
 limits_config:
 retention_period: 72h  # 3 days

2. Add revisionHistoryLimit (common charts)

# common-charts/apps/java-service/templates/deployment.yaml
# common-charts/apps/ai-service/templates/deployment.yaml
spec:
 revisionHistoryLimit: 3

3. staging ai-defense imagePullSecrets

# staging/values/apps/values-ai-defense.yaml
imagePullSecrets: [] # Disabled since IRSA is used

4. dev ai-defense environment variable

# dev/values/apps/values-ai-defense.yaml
env:
 - name: TM_OFFLINE_LLM_AUDIT_PATH
 value: "/tmp/logs/offline_llm_audit.jsonl"

5. dev seat Kafka config

# dev/values/apps/values-seat.yaml
env:
 - name: SPRING_KAFKA_BOOTSTRAP_SERVERS
 value: "kafka.messaging.svc.cluster.local:9092"

Commit

feat(sprint5/troubleshooting): resolve disk pressure and CrashLoopBackOff issues

- common-charts: add revisionHistoryLimit: 3
- dev/staging loki: add compactor for retention
- staging ai-defense: add imagePullSecrets: []
- dev ai-defense: add TM_OFFLINE_LLM_AUDIT_PATH
- dev seat: add Kafka bootstrap servers

Lessons Learned

Loki retention is mandatory — retention_period does nothing without the compactor
local-path-provisioner orphan PVs — deleting a PVC does not automatically remove the directory on disk
Staging uses IRSA — imagePullSecrets: [] must be explicitly set to override the default
revisionHistoryLimit — the default of 10 is too high; cap it at 3

WireGuard + Cilium IP Range Conflict Troubleshooting

Sun, 29 Mar 2026 00:00:00 +0000

Date

2026-03-29

Problem

External WireGuard VPN connections stopped working after migrating to Cilium (eBPF).

Symptoms

ping 10.0.0.1 failing from an external network (hotspot) — 100% packet loss
ssh grgb-vpn unreachable
Everything worked fine from the home network

Root Cause Analysis

Diagnosis step 1: tcpdump

sudo tcpdump -i enp3s0 udp port 51820 -n
# Result: 0 packets — packets never reached the server

Initially suspected a router/ISP issue, but…

Diagnosis step 2: Check Cilium configuration

kubectl -n kube-system get cm cilium-config -o yaml | grep cluster-pool
# cluster-pool-ipv4-cidr: 10.0.0.0/8

Diagnosis step 3: Check routing table

ip route | grep 10.0.0
# 10.0.0.0/24 via 10.0.0.224 dev cilium_host

Root cause

Cilium claimed the entire 10.0.0.0/8 range as its cluster network
WireGuard was also using 10.0.0.0/24
IP range conflict: Cilium was routing WireGuard traffic to cilium_host

Fix

Change WireGuard network range

10.0.0.0/24 → 172.30.0.0/24 (a private IP range that doesn’t overlap with Cilium)

Java OOM Troubleshooting

Mon, 23 Mar 2026 00:00:00 +0000

Problem

On an on-prem K8s cluster (mini-might worker node), a Java app ran out of memory, causing the OOM Killer to fire and bring down the node. This post covers the analysis and resolution.

Symptoms

All pods on the mini-might node showed Unknown status in k9s
Pods recovered sequentially after a reboot
kubectl describe node mini-might Events showed a Rebooted entry

NodeNotReady 5m ago - node went down
Rebooted 58s ago - reboot detected
NodeReady 17s ago - recovery complete

Root Cause Analysis

1. Checking OOM logs in dmesg

After SSHing into the mini-might node:

ALB Elastic IP Auto-Assignment Issue

Sun, 22 Mar 2026 00:00:00 +0000

Problem

Date: 2026-03-22
Environment: EKS Staging (goormgb-staging-eks)
Symptom: Elastic IPs automatically assigned to an internet-facing ALB, generating unexpected costs

Symptoms

How It Was Discovered

Changed ArgoCD from type: LoadBalancer (NLB) to type: ClusterIP (ALB Ingress)
After deleting the NLB, found 2 EIPs remaining
Attempting to disassociate/release the EIPs resulted in a permission error

Error Message

An error occurred (AuthFailure) when calling the DisassociateAddress operation:
You do not have permission to access the specified resource.

Confirmed State

aws ec2 describe-network-interfaces \
 --filters "Name=addresses.private-ip-address,Values=10.0.18.47" \
 --query 'NetworkInterfaces[*].[Description,NetworkInterfaceId]'

# Result: ELB app/k8s-stagingalb-4f414fcf8f/...

Root Cause

AWS Official Answer

Automatic EIP assignment to internet-facing ALBs is expected behavior.

EKS Node Cluster Join Failure Checklist

Sat, 21 Mar 2026 00:00:00 +0000

Problem

EKS node group created but nodes are not joining the cluster
kubectl get nodes shows no nodes
EC2 instances are in Running state

Checklist

1. VPC DNS Settings

Setting	Required Value	Check
`enable_dns_hostnames`	`true`
`enable_dns_support`	`true`

Verification:

resource "aws_vpc" "main" {
 enable_dns_hostnames = true
 enable_dns_support = true
}

2. Subnet Tags

The EKS controller requires specific tags to recognize subnets

Subnet	Tag	Check
Public	`kubernetes.io/role/elb = 1`
Private	`kubernetes.io/role/internal-elb = 1`
Both	`kubernetes.io/cluster/<cluster-name> = shared`

Verification:

EKS Troubleshooting - Node Group Creation Failure and Missing vpc-cni Addon

Sat, 21 Mar 2026 00:00:00 +0000

Problem

Two major issues encountered while building an EKS 1.34 cluster, along with their resolutions.

1. Node Group Creation Failure - kubelet Label Issue

Symptoms

Node Group status: CREATE_FAILED
Error message: NodeCreationFailure - Unhealthy nodes in the kubernetes cluster
EC2 instances are running but not joining the cluster

Root Cause Analysis

Connected to the node via SSM and checked kubelet logs:

sudo journalctl -u kubelet -n 50

kubelet error:

unknown 'kubernetes.io' or 'k8s.io' labels specified with --node-labels

Root Cause

The Terraform EKS module was using node-role.kubernetes.io/infra as a node group label:

K8s Taint vs NodeSelector Troubleshooting

Fri, 20 Mar 2026 00:00:00 +0000

Problem

ArgoCD alerts triggered for Pods in dev-webs showing Degraded status:

dev-order-core
dev-seat
dev-queue

Root Cause

The app-dedicated node (mini-might) ran out of resources

Cause: The control-plane node (mini-gmk) has a node-role.kubernetes.io/control-plane:NoSchedule taint, which prevented infra Pods (istio, loki, calico, coredns, metrics-server, etc.) from being scheduled there. They all got pushed onto the app-only worker node instead.

Infra Pods scheduled on mini-might (worker):
- istio-ingressgateway (x2)
- istiod
- kiali
- loki-0, loki-chunks-cache-0, loki-results-cache-0
- calico-apiserver (x2), calico-kube-controllers, calico-typha
- coredns (x2)
- metrics-server
- alloy

→ Resource contention when deploying app Pods, new Pods go Pending

Kubernetes Service Selector & Endpoints Troubleshooting

Thu, 19 Mar 2026 00:00:00 +0000

Problem

API requests returning 503 Service Unavailable
Istio logs showing: no_healthy_upstream
Pods are in Running state but service routing is broken

Root Cause

Mismatch between Service selector labels and Pod labels

# Check endpoints - <none> means something is wrong!
kubectl -n staging-webs get endpoints api-gateway
NAME ENDPOINTS AGE
api-gateway <none> 34h # ← no Pod IPs!

How It Works

┌─────────────────────────────────────────────────────────┐
│ Service (api-gateway) │
│ selector: │
│ app: staging-api-gateway │
│ app.kubernetes.io/part-of: staging-webs ← required! │
└─────────────────────────────────────────────────────────┘
 │
 ▼ find Pods with matching selector labels
┌─────────────────────────────────────────────────────────┐
│ Endpoints (auto-generated) │
│ addresses: │
│ - 10.0.20.100 (Pod1 IP) │
│ - 10.0.20.101 (Pod2 IP) │
└─────────────────────────────────────────────────────────┘
 │
 ▼ load balance traffic
┌─────────────────────────────────────────────────────────┐
│ Pod (must have labels that match the selector) │
│ labels: │
│ app: staging-api-gateway ✓ │
│ app.kubernetes.io/part-of: staging-webs ✓ │
└─────────────────────────────────────────────────────────┘

Key point: All labels in the Service selector must be present on the Pod for it to be registered in Endpoints

Java Spring Boot Probe Troubleshooting

Wed, 18 Mar 2026 00:00:00 +0000

Problem

Java Spring Boot apps in the Staging EKS environment kept restarting in CrashLoopBackOff.

Symptoms

Pods receive SIGTERM (exit code 143) within 60 seconds of starting
kubelet forcefully kills the container due to liveness probe failures
404 errors (initially) → 503/500 errors (after partial fix)

Root Cause Analysis

1. Missing Context Path (Primary Cause)

Spring Boot apps use a context path:

auth-guard: /auth
queue: /queue
seat: /seat
order-core: /order

But the probe configuration was:

OpenTelemetry + Istio mTLS Troubleshooting

Wed, 18 Mar 2026 00:00:00 +0000

Problem

Java services in the EKS Staging environment failing to start in CrashLoopBackOff. Log analysis revealed that the OTEL (OpenTelemetry) agent was failing to connect to the OTEL Collector.

Environment

Kubernetes: EKS 1.34
Istio: mTLS STRICT mode enabled
OTEL Agent: opentelemetry-javaagent v2.11.0
OTEL Collector: opentelemetry-collector (Deployment)
Namespace: staging-webs (apps), monitoring (OTEL Collector)

Symptoms

Pod Status

staging-webs auth-guard-xxx 0/1 CrashLoopBackOff 17
staging-webs order-core-xxx 0/1 CrashLoopBackOff 21
staging-webs queue-xxx 0/1 CrashLoopBackOff 22
staging-webs seat-xxx 0/1 CrashLoopBackOff 21

Error Logs

[otel.javaagent] ERROR io.opentelemetry.exporter.internal.grpc.GrpcExporter -
Failed to export metrics. Server is UNAVAILABLE.
Make sure your collector is running and reachable from this network.
Full error message: upstream connect error or disconnect/reset before headers.
retried and the latest reset reason: remote connection failure,
transport failure reason: TLS_error:|268435703:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER:TLS_error_end

Root Cause Analysis

1. Normal Istio mTLS Traffic Flow

For typical service-to-service communication:

Chrome QUIC/HTTP3 Intermittent 404 Troubleshooting

Tue, 17 Mar 2026 00:00:00 +0000

Resolving intermittent 404 errors in Chrome on a home server (kubeadm) + SK Broadband environment

Home Server Infrastructure

1.1 Overall Network Architecture

flowchart TB
 subgraph Internet["Internet"]
 CF["Cloudflare Edge<br/>(Seoul PoP)"]
 ISP["SK Broadband<br/>Public IP: Dynamic<br/>(39.119.192.15)"]
 end

 subgraph HomeNetwork["Home Network (192.168.45.0/24)"]
 subgraph Router["SK Broadband Router"]
 NAT["NAT/DHCP"]
 FW_R["Firewall"]
 PF["Port Forwarding<br/>:80 → .154:80<br/>:443 → .154:443"]
 DHCP_RES["DHCP Reservation<br/>mini-gmk: .123<br/>mini-might: .154"]
 end

 subgraph ControlPlane["mini-gmk (Control Plane)<br/>192.168.45.123"]
 UFW_C["ufw firewall<br/>22, 6443, 10250-10252<br/>2379-2380, 179"]
 K8S_CP["Kubernetes Control Plane<br/>kube-apiserver<br/>etcd, scheduler<br/>controller-manager"]
 ISTIOD["istiod (pilot)"]
 CERTM["cert-manager"]
 DDNS["ddns-cloudflare<br/>CronJob"]
 ESO["external-secrets-operator"]
 CALICO_CP["calico-typha<br/>calico-kube-controllers"]
 end

 subgraph Worker["mini-might (Worker)<br/>192.168.45.154"]
 UFW_W["ufw firewall<br/>22, 80, 443<br/>10250, 30000-32767"]
 INGRESS["istio-ingressgateway<br/>externalIPs: .154"]
 GW["java-cloud-gateway<br/>:8085"]
 AUTH["auth-guard (x2)<br/>:8080"]
 APPS["order-core :8083<br/>seat :8082<br/>queue :8081"]
 CALICO_W["calico-node"]
 end

 subgraph VPN_Access["Team Member Access"]
 VPN["Tailscale/WireGuard<br/>VPN"]
 SSH["SSH Tunnel<br/>:22"]
 end
 end

 subgraph Clients["Clients"]
 Chrome["Chrome<br/>(QUIC/HTTP3)"]
 Safari["Safari<br/>(HTTP/2)"]
 Firefox["Firefox<br/>(HTTP/2)"]
 Mobile["Mobile LTE"]
 TeamMember["Team Member (VPN)"]
 end

 Chrome -->|"QUIC/UDP:443<br/>❌ intermittent failure"| ISP
 Safari -->|"HTTP/2/TCP:443"| ISP
 Firefox -->|"HTTP/2/TCP:443"| ISP
 Mobile -->|"HTTP/2"| ISP

 Chrome -->|"QUIC/UDP:443<br/>✅ stable"| CF
 CF -->|"HTTP/2/TCP:443"| ISP

 ISP --> NAT
 NAT --> FW_R
 FW_R --> PF
 PF -->|":443"| UFW_W
 UFW_W --> INGRESS

 INGRESS --> GW
 GW --> AUTH
 AUTH --> APPS

 TeamMember -->|"VPN"| VPN
 VPN --> SSH
 SSH --> UFW_C
 SSH --> UFW_W

 DDNS -->|"API call"| CF

 ISTIOD -.->|"xDS"| INGRESS
 CERTM -.->|"TLS cert"| INGRESS
 ESO -.->|"Secret sync"| DDNS

 classDef problem fill:#ff6b6b,stroke:#c0392b,color:#fff
 classDef solution fill:#2ecc71,stroke:#27ae60,color:#fff
 classDef infrastructure fill:#3498db,stroke:#2980b9,color:#fff

 class Chrome problem
 class CF solution
 class Router,ControlPlane,Worker infrastructure

1.2 SK Broadband Router Settings

flowchart LR
 subgraph SKRouter["SK Broadband Router Settings"]
 direction TB

 subgraph DHCP["DHCP Settings"]
 DHCP_RANGE["DHCP range: 192.168.45.100 ~ .200"]
 DHCP_RES1["Reservation 1: mini-gmk<br/>MAC: XX:XX:XX:XX:XX:XX<br/>IP: 192.168.45.123"]
 DHCP_RES2["Reservation 2: mini-might<br/>MAC: YY:YY:YY:YY:YY:YY<br/>IP: 192.168.45.154"]
 end

 subgraph PortForward["Port Forwarding"]
 PF_HTTP["External :80 → 192.168.45.154:80"]
 PF_HTTPS["External :443 → 192.168.45.154:443"]
 PF_VPN["External :51820 → 192.168.45.123:51820<br/>(WireGuard, optional)"]
 end

 subgraph Firewall["Firewall Settings"]
 FW_IN["Inbound: allow 80, 443"]
 FW_OUT["Outbound: allow all"]
 FW_ICMP["ICMP: allow (ping)"]
 end

 subgraph NAT_Config["NAT Settings"]
 NAT_TYPE["NAT type: Symmetric"]
 NAT_UDP["UDP timeout: 30s (root cause)"]
 NAT_TCP["TCP timeout: 3600s"]
 end
 end

 style NAT_UDP fill:#ff6b6b,stroke:#c0392b,color:#fff

Router settings detail:

Resolving Swagger 403, ArgoCD Dashboard, and CORS Issues

Wed, 11 Mar 2026 00:00:00 +0000

2026-03-11 | Resolving Swagger 403, ArgoCD Dashboard, and CORS Issues

Problem

Accessing swagger.dev.goormgb.space shows a 403 Forbidden page
No redirect to the login page

Root Cause

Browser has a stale cookie (expired _oauth2_proxy cookie)
OAuth2 Proxy validates the cookie, finds it invalid → returns 403
Normal flow: no cookie → 302 redirect → login

[Normal flow - no cookie]
User → OAuth2 Proxy → no cookie → 302 → Google login

[Problem - expired cookie]
User → OAuth2 Proxy → cookie present (expired) → 403 Forbidden
 ↑ stops here!

Fix: EnvoyFilter to convert 403→302 redirect

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
 name: swagger-403-redirect
 namespace: istio-system
spec:
 workloadSelector:
 labels:
 istio: ingressgateway
 configPatches:
 - applyTo: HTTP_FILTER
 match:
 context: GATEWAY
 listener:
 filterChain:
 filter:
 name: envoy.filters.network.http_connection_manager
 subFilter:
 name: envoy.filters.http.router
 patch:
 operation: INSERT_BEFORE
 value:
 name: envoy.filters.http.lua
 typed_config:
 "@type": type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua
 inlineCode: |
 function envoy_on_response(response_handle)
 local host = response_handle:headers():get(":authority") or ""
 local status = response_handle:headers():get(":status")

 -- only handle 403 responses on the swagger domain
 if string.find(host, "swagger") and status == "403" then
 local path = response_handle:headers():get("x-original-uri") or "/"
 local redirect_url = "/oauth2/start?rd=" .. path

 -- convert to 302 redirect
 response_handle:headers():replace(":status", "302")
 response_handle:headers():add("location", redirect_url)
 response_handle:headers():add("cache-control", "no-cache, no-store")
 end
 end

Key Points

envoy_on_response: intercepts at the response phase
swagger domain only: no impact on other services
403 → 302 conversion + /oauth2/start redirect
OAuth2 Proxy restarts the login process

2. CORS Errors Causing Frontend Blank Page

Problem

CORS errors when the frontend makes API calls
Only errors visible in the browser console; the page goes blank

Root Cause

The backend does not include CORS headers in 400/500 error responses

[Successful 200 response]
Access-Control-Allow-Origin: https://dev.goormgb.space ✓
→ Frontend can read the response body

[Error 500 response]
Access-Control-Allow-Origin: (absent) ✗
→ Browser blocks the response → frontend cannot determine the error cause
→ Error handling fails → blank page

Why CORS headers are needed even on error responses

Browser security policy: All responses require CORS headers when origins differ
Error handling: Without CORS headers, response.json() itself fails
UX improvement: Error messages can be surfaced to the user

Fix (forwarded to backend team)

// Spring Boot - include CORS headers in error responses too
@Configuration
public class CorsConfig implements WebMvcConfigurer {
 @Override
 public void addCorsMappings(CorsRegistry registry) {
 registry.addMapping("/**")
 .allowedOrigins("https://dev.goormgb.space")
 .allowedMethods("*")
 .allowCredentials(true);
 }
}

// Or add headers to all exception responses via @ControllerAdvice

3. /auth/token/refresh 404 Error

Problem

Frontend calls /auth/token/refresh and receives 404

Root Cause: Path Mismatch

Side	Path
Frontend call	`/auth/token/refresh`
API Gateway routing	`Path=/auth/**` → Auth-Guard
Auth-Guard actual path	`/token/refresh` (no class-level @RequestMapping)

Frontend: /auth/token/refresh
 ↓
API Gateway: matches /auth/** → forwards to Auth-Guard
 ↓
Auth-Guard: receives /auth/token/refresh
 ↓
404! (only /token/refresh actually exists)

Fix (forwarded to backend team)

Option 1: Add StripPrefix filter (recommended)

Troubleshooting on blog.212clab

Environment-Specific Monitoring Dashboard Separation

Problem

Infrastructure differences by environment

Symptoms

Fix

1. Separate dashboards by environment

Monitoring Data S3 Object Storage Architecture

Problem

Symptoms

Root cause: EBS is bound to a single AZ

OTel Metrics Dashboard Troubleshooting

Problem

Root Cause Analysis

1. Two separate metric collection paths

2. Why only gateway had Micrometer metrics

3. Missing namespace label in OTel metrics

CloudTrail + S3 Bucket Policy Circular Dependency

Problem

Error

Root Cause Analysis

Route53 Record Deletion + Secrets Manager Reset

Problem

Root Cause 1: external-dns policy: sync

The issue

Istio Sidecar Startup Timing (holdApplicationUntilProxyStarts)

Problem

Symptoms

Root Cause

k6-operator MaxVUs Parallelism Error

Problem

Root Cause Analysis

k6-operator execution flow

Problematic code

DiskPressure & CrashLoopBackOff Troubleshooting

Problem

Root Cause Analysis

1. DiskPressure (mini-gmk)

2. CrashLoopBackOff

Fix

Resolving DiskPressure

Helm Values Updates

1. Add Loki retention (dev/staging)

2. Add revisionHistoryLimit (common charts)

3. staging ai-defense imagePullSecrets

4. dev ai-defense environment variable

5. dev seat Kafka config

Commit

Lessons Learned

WireGuard + Cilium IP Range Conflict Troubleshooting

Date

Problem

Symptoms

Root Cause Analysis

Diagnosis step 1: tcpdump

Diagnosis step 2: Check Cilium configuration

Diagnosis step 3: Check routing table

Root cause

Fix

Change WireGuard network range

Java OOM Troubleshooting

Problem

Symptoms

Root Cause Analysis

1. Checking OOM logs in dmesg

ALB Elastic IP Auto-Assignment Issue

Problem

Symptoms

How It Was Discovered

Error Message

Confirmed State

Root Cause

AWS Official Answer

EKS Node Cluster Join Failure Checklist

Problem

Checklist

1. VPC DNS Settings

2. Subnet Tags

EKS Troubleshooting - Node Group Creation Failure and Missing vpc-cni Addon

Problem

Root Cause 1: external-dns `policy: sync`