ItemDetails
Period2026.01 ~ 2026.04
TypeTeam Project (KTCloud TechUp Alliance)
Team size16 members
RoleDevOps / Infrastructure Engineer
Serviceplayball.one
GitHub330-playball-infra
SlidesPresentation Deck (Google Drive)
TroubleshootingFull list

Project Overview#

I designed and operated the full infrastructure for PlayBall, a baseball ticket booking platform. The infrastructure was shaped around five key requirements: fast dev environment setup, cost efficiency, operational observability, security, and handling traffic spikes.


Environment Setup#

EnvironmentRuntimeIngressData LayerPurpose
Devkubeadm (2-node on-prem)Cloudflare + Istio GatewayPostgreSQL Pod, Redis PodFeature development, initial integration testing
StagingAWS EKS (Multi-AZ, Spot)CloudFront + ALBRDS, ElastiCacheQA, load testing, security validation
ProdAWS EKS (Multi-AZ, On-Demand)CloudFront + ALBRDS, ElastiCacheLive service

Tech Stack#

AreaTechnology
CloudAWS EKS 1.35, RDS PostgreSQL 16, ElastiCache Redis 7, CloudFront, ALB, Route53, ACM
Container / MeshKubernetes (kubeadm / EKS), Istio 1.29.1, Cilium (Dev CNI)
IaCTerraform (stacks + environments separation)
CI/CDTeamCity (build) → ECR (image) → ArgoCD (GitOps deploy)
ScalingKEDA 2.19.0 (Cron Scaler + HPA), Karpenter 1.11.1
Secrets / AuthNExternal Secrets Operator, IRSA, AWS IAM Identity Center SSO
ObservabilityPrometheus, Loki, Tempo, Thanos, Grafana, OpenTelemetry Collector
Policy / SecurityKyverno, Policy Reporter, CloudTrail
AlertingEventBridge + Lambda → Discord

Repository Structure (3-repo split)#

Infrastructure provisioning, cluster bootstrapping, and declarative deployment are managed independently.

RepoResponsibilityKey Contents
301 TerraformProvisioningDeclarative management of AWS resources (VPC, EKS, RDS, Redis, CDN, IAM)
302 BootstrapCluster initial setupESO, Karpenter, ArgoCD, Root App, one-time DB initialization
303 HelmGitOps deploymentHelm charts + ArgoCD Applications + continuous deployment via argocd-sync/* branches

Load testing is handled in a separate repo (304-k6-operators) using distributed k6-operator tests.


CI/CD Pipeline#

An automated flow from source code merge through to cluster deployment.

Source code Push/Merge
  → TeamCity: Build & Test → ECR image push
    → Helm values image tag update → argocd-sync/* branch push
      → ArgoCD: Detect change → Auto-sync cluster (deploy)
EnvironmentDeploy BranchNotes
Devargocd-sync/dev
Stagingargocd-sync/staging → ca-stagingSplit after SSO (IAM Identity Center) adoption
Prodargocd-sync/prod → ca-prodSplit after SSO adoption

Monitoring Stack#

Metrics, logs, and traces are unified in Grafana. Loki/Tempo/Thanos data is stored in S3 Object Storage (object storage rather than PVC).

ToolRole
Prometheus + ThanosMetrics collection + long-term retention (S3)
LokiLog collection (S3 storage)
TempoDistributed tracing (S3 storage)
GrafanaUnified dashboard
OpenTelemetry CollectorApp trace/metric collection
CloudTrailAWS API audit trail
Policy ReporterKyverno policy violation visibility
DiscordEventBridge + Lambda alert delivery

Ticketing Characteristics × Infrastructure Design#

CharacteristicInfrastructure Response
Traffic spikesKEDA Cron Scaler pre-scaling + CPU/Memory HPA + Karpenter node scaling + CloudFront caching
AvailabilityMulti-AZ EKS/RDS/Redis + ArgoCD GitOps redeployment + RDS PITR + PDB
Security7-layer security framework + IAM Identity Center SSO + CloudTrail audit + penetration testing
Cost efficiencyStaging Spot + Graviton (ARM) instances + Loki/Tempo S3 lifecycle + Organizations consolidated billing
Operational observability3-signal unified observability (Metrics/Logs/Traces) + CloudTrail + Policy Reporter

7-Layer Security Framework#

To defend against bot/macro attacks and L7 threats, security is structured into 7 layers following the traffic flow from client to application.

Client → ① Frontend → ② CDN/WAF → ③ ALB SG → ④ Istio WAF/RL → ⑤ ext_authz → ⑥ AI → ⑦ App
LayerZoneConfiguration
① Frontend SecurityX-Bot-Token + CSPCloudflare Turnstile token validation, CSP Report-Only
② CDN + AWS WAFCloudFront + WAF Rate-based RuleDDoS protection, Rate-based Rules, Bot Control, IP Reputation
③ ALB SGHTTPS only + Security GroupCloudFront PL + specific team IPs only, blocks direct access
④ Istio L7 DefenseEnvoyFilter WAF (Lua) + Rate Limiting10-pattern detection (header/path/body) + CDN X-Origin-Verify + Rate Limit (Global+Redis/Local)
⑤ ext_authzAuthz Adapter (Go)Real-time bot verdict gateway, Critical API filtering
⑥ AI DefenseAI Behavioral Analysis Engine (Python)Session behavior analysis + fingerprint + VQA challenge, auto-block
⑦ App SecuritySpring GatewayJWT validation + Redis blacklist

I designed and built layers ②–⑥ (the infrastructure security zone):

  • ② CDN + WAF: CloudFront Distribution, WAF WebACL configuration (Terraform)
  • ③ ALB SG: CloudFront Prefix List-based access restriction (Terraform)
  • ④ Istio WAF/RL: EnvoyFilter Lua script, per-path Rate Limit design (Helm)
  • ⑤ ext_authz: Go gRPC server development, Istio EnvoyFilter integration
  • ⑥ AI Defense: Proposed Istio ext_authz approach to AI team, built deployment infrastructure (Helm/IRSA)

Key Achievements#

  • Designed and operated 3-environment infrastructure: Dev (on-prem kubeadm) / Staging (AWS EKS) / Prod (AWS EKS)
  • Built GitOps CI/CD pipeline with TeamCity + ArgoCD
  • Implemented unified observability with OpenTelemetry + Prometheus + Loki + Tempo + Thanos + Grafana
  • Built 7-layer security framework using Istio EnvoyFilter (Lua WAF, Rate Limiting, ext_authz, mTLS STRICT)
  • Established policy management with Kyverno + Policy Reporter
  • Remediated penetration test infrastructure vulnerabilities