Problem#

When Spot instances were reclaimed in the Staging environment and new nodes were provisioned in a different AZ, monitoring pods (Loki, Tempo, etc.) got stuck in Pending because they couldn’t mount their existing PVCs.

Symptoms#

  • Spot node reclaimed → new node provisioned in a different AZ (e.g., ap-northeast-2a → 2c)
  • Monitoring pods rescheduled → existing PVC (EBS) was bound to the previous AZ, mount failed
  • Pod: PendingWarning: FailedAttachVolume
  • Log and trace collection interrupted

Root cause: EBS is bound to a single AZ#

[Before]
  Node (ap-northeast-2a) ← Spot
  └── Loki Pod
      └── PVC → EBS (pinned to ap-northeast-2a)

[After Spot reclaim]
  Node (ap-northeast-2c) ← newly provisioned
  └── Loki Pod (Pending)
      └── PVC → EBS (ap-northeast-2a) ← cannot be mounted!

EBS volumes can only be used in the AZ where they were created. Because Spot instances can be provisioned in any available AZ, AZ mismatches occurred frequently.

Additional problems with PVC (EBS)#

ProblemDescription
AZ lock-inEBS bound to a single AZ → mount failure on Spot reclaim
Costgp3 100GB = ~$8/month; disk expansion needed as data grows
Capacity limitFull disk causes Pod CrashLoop, manual expansion required
Data durabilityComplex recovery on Pod/node failure

Fix#

Redesigned to use S3 as the persistent storage backend for logs (Loki), traces (Tempo), and long-term metrics (Thanos).

S3 vs PVC comparison#

ItemPVC (EBS)S3
Costgp3 100GB = ~$8/month100GB = ~$2.3/month
CapacityFixed, manual expansionUnlimited
AZ constraintSingle AZ boundAccessible from any AZ in region
Spot compatibilityAZ mismatch riskNo impact
LifecycleManual managementAutomatic (Glacier transition, expiration)
Durability99.999%99.999999999% (11 nines)

Per-tool S3 storage architecture#

1. Loki (logs)#

[App Pod] → stdout
    → [Promtail/Alloy] → log collection
        → [Loki Ingester] → buffer chunks in memory (WAL)
            → threshold reached → flush chunks to S3
  • Ingester accumulates logs in memory, then flushes in bulk to S3
  • PVC used minimally (WAL only) or not at all
  • Queries read directly from S3 (via the Querier component)
# Loki S3 configuration
loki:
  storage:
    type: s3
    s3:
      region: ap-northeast-2
      bucketnames: goormgb-loki-chunks

2. Tempo (traces)#

[App Pod] → OTLP (gRPC:4317)
    → [OTel Collector] → trace collection
        → [Tempo Ingester] → buffer blocks in memory (WAL)
            → threshold reached → flush blocks to S3
  • Nearly the same pattern as Loki
  • Trace data stored in S3 in block units
  • Queries read from S3
# Tempo S3 configuration
tempo:
  storage:
    trace:
      backend: s3
      s3:
        bucket: goormgb-tempo-traces
        region: ap-northeast-2

3. Thanos (long-term metrics)#

Thanos follows a different pattern from Loki/Tempo. Prometheus continues to use a PVC, and the Thanos Sidecar copies TSDB blocks to S3.

[Prometheus] → local TSDB (PVC, retains recent data for a few days)
    → [Thanos Sidecar] → uploads TSDB blocks to S3 every 2 hours

Query path:
    [Thanos Query]
        → recent data: query Prometheus directly (via Sidecar)
        → historical data: query S3 (via Thanos Store Gateway)
  • Prometheus still uses a PVC (for recent data and WAL)
  • Thanos Sidecar uploads completed TSDB blocks to S3
  • Historical queries are served by Thanos Store Gateway reading from S3
# Thanos S3 configuration (Prometheus values)
prometheus:
  prometheusSpec:
    thanos:
      objectStorageConfig:
        secret:
          type: S3
          config:
            bucket: goormgb-thanos-metrics
            region: ap-northeast-2

Overall data flow#

                        ┌─── S3: goormgb-loki-chunks ───┐
[Loki Ingester]    ───▶ │  log chunks                    │
                        │  Lifecycle: 30d → Glacier       │
                        │              400d → delete      │
                        └─────────────────────────────────┘

                        ┌─── S3: goormgb-tempo-traces ──┐
[Tempo Ingester]   ───▶ │  trace blocks                  │
                        │  Lifecycle: 30d → Glacier       │
                        │              400d → delete      │
                        └─────────────────────────────────┘

                        ┌─── S3: goormgb-thanos-metrics ─┐
[Thanos Sidecar]   ───▶ │  TSDB blocks (2h intervals)    │
                        │  Lifecycle: 30d → Glacier       │
                        │              400d → delete      │
                        └─────────────────────────────────┘

S3 Lifecycle Policy#

S3 Lifecycle policies applied for cost optimization:

PeriodStorage classCost (per 100GB)
0–30 daysS3 Standard~$2.3/month
30–400 daysS3 Glacier~$0.4/month
After 400 daysDeleted$0

Recent data remains fast to query, while older data automatically transitions to Glacier to reduce costs.


S3 Access: IRSA#

Pods access S3 using IRSA (IAM Roles for Service Accounts) rather than static AWS access keys.

[Loki Pod]
    → ServiceAccount: loki-sa
        → annotation: eks.amazonaws.com/role-arn: arn:aws:iam::role/loki-irsa
            → IAM Role: S3 PutObject/GetObject permissions only
  • Temporary credentials issued via OIDC, no static keys
  • Least-privilege access per pod
  • No key rotation required

Summary#

ToolLocal storage (PVC)S3 storageWrite pattern
LokiWAL only (minimal)All log chunksIngester flushes directly to S3
TempoWAL only (minimal)All trace blocksIngester flushes directly to S3
ThanosPrometheus PVC (recent data)TSDB block copiesSidecar uploads every 2 hours

Key reasons for choosing S3-backed storage:

  1. Avoids PVC AZ mismatch on Staging Spot instances
  2. Eliminates manual disk expansion as data grows
  3. ~70% cost reduction compared to EBS (with Lifecycle policies)
  4. 11 nines durability for data safety