Monitoring Data S3 Object Storage Architecture

Problem#

When Spot instances were reclaimed in the Staging environment and new nodes were provisioned in a different AZ, monitoring pods (Loki, Tempo, etc.) got stuck in Pending because they couldn’t mount their existing PVCs.

Symptoms#

Spot node reclaimed → new node provisioned in a different AZ (e.g., ap-northeast-2a → 2c)
Monitoring pods rescheduled → existing PVC (EBS) was bound to the previous AZ, mount failed
Pod: Pending → Warning: FailedAttachVolume
Log and trace collection interrupted

Root cause: EBS is bound to a single AZ#

[Before]
  Node (ap-northeast-2a) ← Spot
  └── Loki Pod
      └── PVC → EBS (pinned to ap-northeast-2a)

[After Spot reclaim]
  Node (ap-northeast-2c) ← newly provisioned
  └── Loki Pod (Pending)
      └── PVC → EBS (ap-northeast-2a) ← cannot be mounted!

EBS volumes can only be used in the AZ where they were created. Because Spot instances can be provisioned in any available AZ, AZ mismatches occurred frequently.

Additional problems with PVC (EBS)#

Problem	Description
AZ lock-in	EBS bound to a single AZ → mount failure on Spot reclaim
Cost	gp3 100GB = ~$8/month; disk expansion needed as data grows
Capacity limit	Full disk causes Pod CrashLoop, manual expansion required
Data durability	Complex recovery on Pod/node failure

Fix#

Redesigned to use S3 as the persistent storage backend for logs (Loki), traces (Tempo), and long-term metrics (Thanos).

S3 vs PVC comparison#

Item	PVC (EBS)	S3
Cost	gp3 100GB = ~$8/month	100GB = ~$2.3/month
Capacity	Fixed, manual expansion	Unlimited
AZ constraint	Single AZ bound	Accessible from any AZ in region
Spot compatibility	AZ mismatch risk	No impact
Lifecycle	Manual management	Automatic (Glacier transition, expiration)
Durability	99.999%	99.999999999% (11 nines)

Per-tool S3 storage architecture#

1. Loki (logs)#

[App Pod] → stdout
    → [Promtail/Alloy] → log collection
        → [Loki Ingester] → buffer chunks in memory (WAL)
            → threshold reached → flush chunks to S3

Ingester accumulates logs in memory, then flushes in bulk to S3
PVC used minimally (WAL only) or not at all
Queries read directly from S3 (via the Querier component)

# Loki S3 configuration
loki:
  storage:
    type: s3
    s3:
      region: ap-northeast-2
      bucketnames: goormgb-loki-chunks

2. Tempo (traces)#

[App Pod] → OTLP (gRPC:4317)
    → [OTel Collector] → trace collection
        → [Tempo Ingester] → buffer blocks in memory (WAL)
            → threshold reached → flush blocks to S3

Nearly the same pattern as Loki
Trace data stored in S3 in block units
Queries read from S3

# Tempo S3 configuration
tempo:
  storage:
    trace:
      backend: s3
      s3:
        bucket: goormgb-tempo-traces
        region: ap-northeast-2

3. Thanos (long-term metrics)#

Thanos follows a different pattern from Loki/Tempo. Prometheus continues to use a PVC, and the Thanos Sidecar copies TSDB blocks to S3.

[Prometheus] → local TSDB (PVC, retains recent data for a few days)
    → [Thanos Sidecar] → uploads TSDB blocks to S3 every 2 hours

Query path:
    [Thanos Query]
        → recent data: query Prometheus directly (via Sidecar)
        → historical data: query S3 (via Thanos Store Gateway)

Prometheus still uses a PVC (for recent data and WAL)
Thanos Sidecar uploads completed TSDB blocks to S3
Historical queries are served by Thanos Store Gateway reading from S3

# Thanos S3 configuration (Prometheus values)
prometheus:
  prometheusSpec:
    thanos:
      objectStorageConfig:
        secret:
          type: S3
          config:
            bucket: goormgb-thanos-metrics
            region: ap-northeast-2

Overall data flow#

                        ┌─── S3: goormgb-loki-chunks ───┐
[Loki Ingester]    ───▶ │  log chunks                    │
                        │  Lifecycle: 30d → Glacier       │
                        │              400d → delete      │
                        └─────────────────────────────────┘

                        ┌─── S3: goormgb-tempo-traces ──┐
[Tempo Ingester]   ───▶ │  trace blocks                  │
                        │  Lifecycle: 30d → Glacier       │
                        │              400d → delete      │
                        └─────────────────────────────────┘

                        ┌─── S3: goormgb-thanos-metrics ─┐
[Thanos Sidecar]   ───▶ │  TSDB blocks (2h intervals)    │
                        │  Lifecycle: 30d → Glacier       │
                        │              400d → delete      │
                        └─────────────────────────────────┘

S3 Lifecycle Policy#

S3 Lifecycle policies applied for cost optimization:

Period	Storage class	Cost (per 100GB)
0–30 days	S3 Standard	~$2.3/month
30–400 days	S3 Glacier	~$0.4/month
After 400 days	Deleted	$0

Recent data remains fast to query, while older data automatically transitions to Glacier to reduce costs.

S3 Access: IRSA#

Pods access S3 using IRSA (IAM Roles for Service Accounts) rather than static AWS access keys.

[Loki Pod]
    → ServiceAccount: loki-sa
        → annotation: eks.amazonaws.com/role-arn: arn:aws:iam::role/loki-irsa
            → IAM Role: S3 PutObject/GetObject permissions only

Temporary credentials issued via OIDC, no static keys
Least-privilege access per pod
No key rotation required

Summary#

Tool	Local storage (PVC)	S3 storage	Write pattern
Loki	WAL only (minimal)	All log chunks	Ingester flushes directly to S3
Tempo	WAL only (minimal)	All trace blocks	Ingester flushes directly to S3
Thanos	Prometheus PVC (recent data)	TSDB block copies	Sidecar uploads every 2 hours

Key reasons for choosing S3-backed storage:

Avoids PVC AZ mismatch on Staging Spot instances
Eliminates manual disk expansion as data grows
~70% cost reduction compared to EBS (with Lifecycle policies)
11 nines durability for data safety