Istio ext_authz Adapter Design and Implementation
Background#
The AI defense team needed to handle two things: real-time bot detection and post-event analysis. Embedding this directly into application code would mean planting an SDK into every service, making it a maintenance nightmare. By using Istio’s ext_authz extension point, requests can be transparently intercepted at the mesh level without touching app code and forwarded to the AI engine. I proposed this approach to the AI team and we built it together.
Division of Responsibility#
| Owner | Work |
|---|---|
| Infrastructure (me) | Istio ext_authz integration design, Go gRPC adapter development, EnvoyFilter authoring, Helm deployment configuration |
| AI team | Critical API selection, verdict spec design, AI Defense API (Python FastAPI) development, behavioral analysis engine |
Core idea: Envoy is already proxying every request — just hook ext_authz into it and get an AI verdict there.
Architecture#
Full Request Flow#
[Browser]
↓ HTTPS
[Istio IngressGateway (Envoy)]
↓ ext_authz gRPC (Critical APIs only)
[Authz Adapter (Go :9001)]
↓ POST /ai/evaluate
[AI Defense API (Python :8000)]
↓ Redis (session/blacklist)
↓ Brain Decision Engine
↓
Response: NONE | CHALLENGE | THROTTLE | BLOCK
↓
[Authz Adapter]
├─ NONE → 200 (pass)
├─ CHALLENGE → 428 (challenge required)
├─ THROTTLE → 200 + delay
└─ BLOCK → 403 + Redis blacklist registration
↓
[Envoy → Backend API]
Defense Layer Structure#
Layer 1: CloudFront + AWS WAF (Edge - L7 DDoS, bulk blocking)
Layer 2: Istio EnvoyFilter WAF (Mesh - SQLi/XSS/CmdInj detection)
Layer 3: Istio Rate Limiting (Mesh - per-path request limiting)
Layer 4: ext_authz + AI Defense (Mesh - behavior-based bot detection) ← this post
Authz Adapter (Go)#
Role#
A Go gRPC server that implements Istio’s ext_authz protocol. Acts as a bridge between Envoy and the AI Defense API.
Core Features#
1. Envoy ext_authz gRPC Server#
// Envoy calls this via gRPC for each request
func (h *Handler) Check(ctx context.Context, req *authv3.CheckRequest) (*authv3.CheckResponse, error) {
// 1. Extract IP, path, X-Bot-Token from request headers
// 2. Check whitelist (team IPs)
// 3. Check blacklist (Redis DB 5)
// 4. Call AI Defense API
// 5. Respond with allow/block/challenge based on result
}
2. Critical API Filtering#
Sending every request to the AI would be costly and slow, so only 7 core ticketing APIs are evaluated:
/api/queue/entry ← queue entry
/api/seat/select ← seat selection
/api/payment/process ← payment processing
/api/booking/confirm ← booking confirmation
/api/user/signup ← user registration
/api/auth/login ← login
/api/order/create ← order creation
All other paths bypass ext_authz and pass through directly.
3. IP Blacklist / Whitelist (Redis)#
Redis DB 3: CDN Edge blacklist (managed by Lambda, TTL 30 days)
Redis DB 5: Real-time blacklist (managed by Adapter, TTL 7 days)
- On a BLOCK verdict, the IP is registered in Redis DB 5 — subsequent requests are blocked immediately without calling the AI
- 22 team member IPs are whitelisted and always pass through
4. X-Bot-Token Validation#
The frontend (Next.js, Vercel) generates an X-Bot-Token via getBotToken() and includes it in request headers. The Adapter validates this token.
[Frontend (Next.js)]
→ generate X-Bot-Token via getBotToken()
→ request header: X-Bot-Token: <token>
↓
[Authz Adapter]
→ validate token
→ missing or invalid → 403
5. Fail-Open Policy#
If the AI Defense API is unresponsive (timeout: 800ms), the request is allowed through. This is a deliberate availability-over-security design choice — the defense system failing at ticket sale time should never block legitimate users.
AI Defense API (Python)#
A FastAPI-based bot detection engine built by the AI team. Below is the integration structure from an infrastructure perspective.
API Endpoints#
| Endpoint | Role |
|---|---|
POST /ai/evaluate | Evaluate request → behavioral verdict (called by Adapter) |
POST /ai/challenge/start | Issue VQA challenge token |
POST /ai/challenge/verify | Verify challenge response |
POST /ai/telemetry/ingest | Collect browser telemetry |
POST /ai/precheck | Pre-validation before queue entry |
Verdict Actions#
| Action | HTTP Response | Description |
|---|---|---|
| NONE | 200 | Normal pass-through |
| CHALLENGE | 428 | VQA challenge (Catch Ball mini-game) required |
| THROTTLE | 200 + delay | T1: 200ms / T2: 1800ms delay |
| GATE | 429 | Temporary block |
| BLOCK | 403 | Permanent block + IP blacklist |
Session State Management#
Redis DB 0: Session state
- Per-user request history
- Accumulated behavior patterns
- Bot confidence score
Redis DB 5: Blacklist
- IPs with BLOCK verdict
- TTL 7 days
Istio Integration (EnvoyFilter)#
ext_authz Cluster Definition#
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: ext-authz-cluster
namespace: istio-system
spec:
configPatches:
- applyTo: CLUSTER
patch:
operation: ADD
value:
name: authz-adapter
type: STRICT_DNS
connect_timeout: 0.8s
lb_policy: ROUND_ROBIN
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
"@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
explicit_http_config:
http2_protocol_options: {}
load_assignment:
cluster_name: authz-adapter
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: authz-adapter.staging-ai.svc.cluster.local
port_value: 9001
Lua Path Matching + ext_authz Trigger#
- applyTo: HTTP_FILTER
patch:
operation: INSERT_BEFORE
value:
name: envoy.filters.http.lua
typed_config:
inlineCode: |
function envoy_on_request(request_handle)
local path = request_handle:headers():get(":path")
-- Trigger ext_authz only for Critical API paths
if matches_critical_path(path) then
request_handle:headers():add("x-need-authz", "true")
end
end
Deployment Configuration#
Helm Values#
# values-authz-adapter.yaml
replicaCount: 2
image:
repository: ghcr.io/goorm-gongbang/authz-adapter
ports:
grpc: 9001
metrics: 9090
env:
AI_DEFENSE_URL: "http://ai-defense.staging-ai.svc.cluster.local:8000/ai/evaluate"
AI_DEFENSE_TIMEOUT: "800ms"
REDIS_ADDR: "redis.staging-data.svc.cluster.local:6379"
OTEL_ENABLED: "true"
Observability#
| Item | Configuration |
|---|---|
| Metrics | Prometheus :9090 (request count, latency, verdict distribution) |
| Traces | OpenTelemetry → Collector → Tempo |
| Logs | Structured JSON → Loki |
Summary#
| Component | Language | Role | Owner |
|---|---|---|---|
| Authz Adapter | Go | Envoy ↔ AI bridge, IP management, token validation | Infrastructure (me) |
| AI Defense API | Python | Behavioral analysis, bot verdict, challenge issuance | AI team |
| EnvoyFilter | YAML | ext_authz integration, path filtering | Infrastructure (me) |
| Helm Charts | YAML | Deployment config, per-environment values | Infrastructure (me) |
By leveraging Istio’s ext_authz extension point, AI bot detection is applied transparently at the mesh level without modifying any application code. The Fail-Open policy ensures availability, while evaluating only Critical APIs keeps latency to a minimum.
Reference Repos#
- 202-goormgb-authz-adapter — Go gRPC ext_authz adapter
- 201-goormgb-ai — AI Defense API (Python FastAPI)