Istio ext_authz Adapter Design and Implementation

Background#

The AI defense team needed to handle two things: real-time bot detection and post-event analysis. Embedding this directly into application code would mean planting an SDK into every service, making it a maintenance nightmare. By using Istio’s ext_authz extension point, requests can be transparently intercepted at the mesh level without touching app code and forwarded to the AI engine. I proposed this approach to the AI team and we built it together.

Division of Responsibility#

Owner	Work
Infrastructure (me)	Istio ext_authz integration design, Go gRPC adapter development, EnvoyFilter authoring, Helm deployment configuration
AI team	Critical API selection, verdict spec design, AI Defense API (Python FastAPI) development, behavioral analysis engine

Core idea: Envoy is already proxying every request — just hook ext_authz into it and get an AI verdict there.

Architecture#

Full Request Flow#

[Browser]
    ↓ HTTPS
[Istio IngressGateway (Envoy)]
    ↓ ext_authz gRPC (Critical APIs only)
[Authz Adapter (Go :9001)]
    ↓ POST /ai/evaluate
[AI Defense API (Python :8000)]
    ↓ Redis (session/blacklist)
    ↓ Brain Decision Engine
    ↓
Response: NONE | CHALLENGE | THROTTLE | BLOCK
    ↓
[Authz Adapter]
    ├─ NONE      → 200 (pass)
    ├─ CHALLENGE → 428 (challenge required)
    ├─ THROTTLE  → 200 + delay
    └─ BLOCK     → 403 + Redis blacklist registration
    ↓
[Envoy → Backend API]

Defense Layer Structure#

Layer 1: CloudFront + AWS WAF     (Edge - L7 DDoS, bulk blocking)
Layer 2: Istio EnvoyFilter WAF    (Mesh - SQLi/XSS/CmdInj detection)
Layer 3: Istio Rate Limiting      (Mesh - per-path request limiting)
Layer 4: ext_authz + AI Defense   (Mesh - behavior-based bot detection) ← this post

Authz Adapter (Go)#

Role#

A Go gRPC server that implements Istio’s ext_authz protocol. Acts as a bridge between Envoy and the AI Defense API.

Core Features#

1. Envoy ext_authz gRPC Server#

// Envoy calls this via gRPC for each request
func (h *Handler) Check(ctx context.Context, req *authv3.CheckRequest) (*authv3.CheckResponse, error) {
    // 1. Extract IP, path, X-Bot-Token from request headers
    // 2. Check whitelist (team IPs)
    // 3. Check blacklist (Redis DB 5)
    // 4. Call AI Defense API
    // 5. Respond with allow/block/challenge based on result
}

2. Critical API Filtering#

Sending every request to the AI would be costly and slow, so only 7 core ticketing APIs are evaluated:

/api/queue/entry          ← queue entry
/api/seat/select          ← seat selection
/api/payment/process      ← payment processing
/api/booking/confirm      ← booking confirmation
/api/user/signup          ← user registration
/api/auth/login           ← login
/api/order/create         ← order creation

All other paths bypass ext_authz and pass through directly.

3. IP Blacklist / Whitelist (Redis)#

Redis DB 3: CDN Edge blacklist (managed by Lambda, TTL 30 days)
Redis DB 5: Real-time blacklist (managed by Adapter, TTL 7 days)

On a BLOCK verdict, the IP is registered in Redis DB 5 — subsequent requests are blocked immediately without calling the AI
22 team member IPs are whitelisted and always pass through

4. X-Bot-Token Validation#

The frontend (Next.js, Vercel) generates an X-Bot-Token via getBotToken() and includes it in request headers. The Adapter validates this token.

[Frontend (Next.js)]
    → generate X-Bot-Token via getBotToken()
    → request header: X-Bot-Token: <token>
        ↓
[Authz Adapter]
    → validate token
    → missing or invalid → 403

5. Fail-Open Policy#

If the AI Defense API is unresponsive (timeout: 800ms), the request is allowed through. This is a deliberate availability-over-security design choice — the defense system failing at ticket sale time should never block legitimate users.

AI Defense API (Python)#

A FastAPI-based bot detection engine built by the AI team. Below is the integration structure from an infrastructure perspective.

API Endpoints#

Endpoint	Role
`POST /ai/evaluate`	Evaluate request → behavioral verdict (called by Adapter)
`POST /ai/challenge/start`	Issue VQA challenge token
`POST /ai/challenge/verify`	Verify challenge response
`POST /ai/telemetry/ingest`	Collect browser telemetry
`POST /ai/precheck`	Pre-validation before queue entry

Verdict Actions#

Action	HTTP Response	Description
NONE	200	Normal pass-through
CHALLENGE	428	VQA challenge (Catch Ball mini-game) required
THROTTLE	200 + delay	T1: 200ms / T2: 1800ms delay
GATE	429	Temporary block
BLOCK	403	Permanent block + IP blacklist

Session State Management#

Redis DB 0: Session state
  - Per-user request history
  - Accumulated behavior patterns
  - Bot confidence score

Redis DB 5: Blacklist
  - IPs with BLOCK verdict
  - TTL 7 days

Istio Integration (EnvoyFilter)#

ext_authz Cluster Definition#

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: ext-authz-cluster
  namespace: istio-system
spec:
  configPatches:
    - applyTo: CLUSTER
      patch:
        operation: ADD
        value:
          name: authz-adapter
          type: STRICT_DNS
          connect_timeout: 0.8s
          lb_policy: ROUND_ROBIN
          typed_extension_protocol_options:
            envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
              "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
              explicit_http_config:
                http2_protocol_options: {}
          load_assignment:
            cluster_name: authz-adapter
            endpoints:
              - lb_endpoints:
                  - endpoint:
                      address:
                        socket_address:
                          address: authz-adapter.staging-ai.svc.cluster.local
                          port_value: 9001

Lua Path Matching + ext_authz Trigger#

- applyTo: HTTP_FILTER
  patch:
    operation: INSERT_BEFORE
    value:
      name: envoy.filters.http.lua
      typed_config:
        inlineCode: |
          function envoy_on_request(request_handle)
            local path = request_handle:headers():get(":path")
            -- Trigger ext_authz only for Critical API paths
            if matches_critical_path(path) then
              request_handle:headers():add("x-need-authz", "true")
            end
          end

Deployment Configuration#

Helm Values#

# values-authz-adapter.yaml
replicaCount: 2
image:
  repository: ghcr.io/goorm-gongbang/authz-adapter
ports:
  grpc: 9001
  metrics: 9090
env:
  AI_DEFENSE_URL: "http://ai-defense.staging-ai.svc.cluster.local:8000/ai/evaluate"
  AI_DEFENSE_TIMEOUT: "800ms"
  REDIS_ADDR: "redis.staging-data.svc.cluster.local:6379"
  OTEL_ENABLED: "true"

Observability#

Item	Configuration
Metrics	Prometheus `:9090` (request count, latency, verdict distribution)
Traces	OpenTelemetry → Collector → Tempo
Logs	Structured JSON → Loki

Summary#

Component	Language	Role	Owner
Authz Adapter	Go	Envoy ↔ AI bridge, IP management, token validation	Infrastructure (me)
AI Defense API	Python	Behavioral analysis, bot verdict, challenge issuance	AI team
EnvoyFilter	YAML	ext_authz integration, path filtering	Infrastructure (me)
Helm Charts	YAML	Deployment config, per-environment values	Infrastructure (me)

By leveraging Istio’s ext_authz extension point, AI bot detection is applied transparently at the mesh level without modifying any application code. The Fail-Open policy ensures availability, while evaluating only Critical APIs keeps latency to a minimum.

Reference Repos#

202-goormgb-authz-adapter — Go gRPC ext_authz adapter
201-goormgb-ai — AI Defense API (Python FastAPI)