Project Overview#
I designed and operated the full infrastructure for PlayBall, a baseball ticket booking platform. The infrastructure was shaped around five key requirements: fast dev environment setup, cost efficiency, operational observability, security, and handling traffic spikes.
Environment Setup#
| Environment | Runtime | Ingress | Data Layer | Purpose |
|---|
| Dev | kubeadm (2-node on-prem) | Cloudflare + Istio Gateway | PostgreSQL Pod, Redis Pod | Feature development, initial integration testing |
| Staging | AWS EKS (Multi-AZ, Spot) | CloudFront + ALB | RDS, ElastiCache | QA, load testing, security validation |
| Prod | AWS EKS (Multi-AZ, On-Demand) | CloudFront + ALB | RDS, ElastiCache | Live service |
Tech Stack#
| Area | Technology |
|---|
| Cloud | AWS EKS 1.35, RDS PostgreSQL 16, ElastiCache Redis 7, CloudFront, ALB, Route53, ACM |
| Container / Mesh | Kubernetes (kubeadm / EKS), Istio 1.29.1, Cilium (Dev CNI) |
| IaC | Terraform (stacks + environments separation) |
| CI/CD | TeamCity (build) → ECR (image) → ArgoCD (GitOps deploy) |
| Scaling | KEDA 2.19.0 (Cron Scaler + HPA), Karpenter 1.11.1 |
| Secrets / AuthN | External Secrets Operator, IRSA, AWS IAM Identity Center SSO |
| Observability | Prometheus, Loki, Tempo, Thanos, Grafana, OpenTelemetry Collector |
| Policy / Security | Kyverno, Policy Reporter, CloudTrail |
| Alerting | EventBridge + Lambda → Discord |
Repository Structure (3-repo split)#
Infrastructure provisioning, cluster bootstrapping, and declarative deployment are managed independently.
| Repo | Responsibility | Key Contents |
|---|
| 301 Terraform | Provisioning | Declarative management of AWS resources (VPC, EKS, RDS, Redis, CDN, IAM) |
| 302 Bootstrap | Cluster initial setup | ESO, Karpenter, ArgoCD, Root App, one-time DB initialization |
| 303 Helm | GitOps deployment | Helm charts + ArgoCD Applications + continuous deployment via argocd-sync/* branches |
Load testing is handled in a separate repo (304-k6-operators) using distributed k6-operator tests.
CI/CD Pipeline#
An automated flow from source code merge through to cluster deployment.
Source code Push/Merge
→ TeamCity: Build & Test → ECR image push
→ Helm values image tag update → argocd-sync/* branch push
→ ArgoCD: Detect change → Auto-sync cluster (deploy)
| Environment | Deploy Branch | Notes |
|---|
| Dev | argocd-sync/dev | |
| Staging | argocd-sync/staging → ca-staging | Split after SSO (IAM Identity Center) adoption |
| Prod | argocd-sync/prod → ca-prod | Split after SSO adoption |
Monitoring Stack#
Metrics, logs, and traces are unified in Grafana. Loki/Tempo/Thanos data is stored in S3 Object Storage (object storage rather than PVC).
| Tool | Role |
|---|
| Prometheus + Thanos | Metrics collection + long-term retention (S3) |
| Loki | Log collection (S3 storage) |
| Tempo | Distributed tracing (S3 storage) |
| Grafana | Unified dashboard |
| OpenTelemetry Collector | App trace/metric collection |
| CloudTrail | AWS API audit trail |
| Policy Reporter | Kyverno policy violation visibility |
| Discord | EventBridge + Lambda alert delivery |
Ticketing Characteristics × Infrastructure Design#
| Characteristic | Infrastructure Response |
|---|
| Traffic spikes | KEDA Cron Scaler pre-scaling + CPU/Memory HPA + Karpenter node scaling + CloudFront caching |
| Availability | Multi-AZ EKS/RDS/Redis + ArgoCD GitOps redeployment + RDS PITR + PDB |
| Security | 7-layer security framework + IAM Identity Center SSO + CloudTrail audit + penetration testing |
| Cost efficiency | Staging Spot + Graviton (ARM) instances + Loki/Tempo S3 lifecycle + Organizations consolidated billing |
| Operational observability | 3-signal unified observability (Metrics/Logs/Traces) + CloudTrail + Policy Reporter |
7-Layer Security Framework#
To defend against bot/macro attacks and L7 threats, security is structured into 7 layers following the traffic flow from client to application.
Client → ① Frontend → ② CDN/WAF → ③ ALB SG → ④ Istio WAF/RL → ⑤ ext_authz → ⑥ AI → ⑦ App
| Layer | Zone | Configuration |
|---|
| ① Frontend Security | X-Bot-Token + CSP | Cloudflare Turnstile token validation, CSP Report-Only |
| ② CDN + AWS WAF | CloudFront + WAF Rate-based Rule | DDoS protection, Rate-based Rules, Bot Control, IP Reputation |
| ③ ALB SG | HTTPS only + Security Group | CloudFront PL + specific team IPs only, blocks direct access |
| ④ Istio L7 Defense | EnvoyFilter WAF (Lua) + Rate Limiting | 10-pattern detection (header/path/body) + CDN X-Origin-Verify + Rate Limit (Global+Redis/Local) |
| ⑤ ext_authz | Authz Adapter (Go) | Real-time bot verdict gateway, Critical API filtering |
| ⑥ AI Defense | AI Behavioral Analysis Engine (Python) | Session behavior analysis + fingerprint + VQA challenge, auto-block |
| ⑦ App Security | Spring Gateway | JWT validation + Redis blacklist |
I designed and built layers ②–⑥ (the infrastructure security zone):
- ② CDN + WAF: CloudFront Distribution, WAF WebACL configuration (Terraform)
- ③ ALB SG: CloudFront Prefix List-based access restriction (Terraform)
- ④ Istio WAF/RL: EnvoyFilter Lua script, per-path Rate Limit design (Helm)
- ⑤ ext_authz: Go gRPC server development, Istio EnvoyFilter integration
- ⑥ AI Defense: Proposed Istio ext_authz approach to AI team, built deployment infrastructure (Helm/IRSA)
Key Achievements#
- Designed and operated 3-environment infrastructure: Dev (on-prem kubeadm) / Staging (AWS EKS) / Prod (AWS EKS)
- Built GitOps CI/CD pipeline with TeamCity + ArgoCD
- Implemented unified observability with OpenTelemetry + Prometheus + Loki + Tempo + Thanos + Grafana
- Built 7-layer security framework using Istio EnvoyFilter (Lua WAF, Rate Limiting, ext_authz, mTLS STRICT)
- Established policy management with Kyverno + Policy Reporter
- Remediated penetration test infrastructure vulnerabilities