Monitoring Data S3 Object Storage Architecture
Problem#
When Spot instances were reclaimed in the Staging environment and new nodes were provisioned in a different AZ, monitoring pods (Loki, Tempo, etc.) got stuck in Pending because they couldn’t mount their existing PVCs.
Symptoms#
- Spot node reclaimed → new node provisioned in a different AZ (e.g., ap-northeast-2a → 2c)
- Monitoring pods rescheduled → existing PVC (EBS) was bound to the previous AZ, mount failed
- Pod:
Pending→Warning: FailedAttachVolume - Log and trace collection interrupted
Root cause: EBS is bound to a single AZ#
[Before]
Node (ap-northeast-2a) ← Spot
└── Loki Pod
└── PVC → EBS (pinned to ap-northeast-2a)
[After Spot reclaim]
Node (ap-northeast-2c) ← newly provisioned
└── Loki Pod (Pending)
└── PVC → EBS (ap-northeast-2a) ← cannot be mounted!
EBS volumes can only be used in the AZ where they were created. Because Spot instances can be provisioned in any available AZ, AZ mismatches occurred frequently.
Additional problems with PVC (EBS)#
| Problem | Description |
|---|---|
| AZ lock-in | EBS bound to a single AZ → mount failure on Spot reclaim |
| Cost | gp3 100GB = ~$8/month; disk expansion needed as data grows |
| Capacity limit | Full disk causes Pod CrashLoop, manual expansion required |
| Data durability | Complex recovery on Pod/node failure |
Fix#
Redesigned to use S3 as the persistent storage backend for logs (Loki), traces (Tempo), and long-term metrics (Thanos).
S3 vs PVC comparison#
| Item | PVC (EBS) | S3 |
|---|---|---|
| Cost | gp3 100GB = ~$8/month | 100GB = ~$2.3/month |
| Capacity | Fixed, manual expansion | Unlimited |
| AZ constraint | Single AZ bound | Accessible from any AZ in region |
| Spot compatibility | AZ mismatch risk | No impact |
| Lifecycle | Manual management | Automatic (Glacier transition, expiration) |
| Durability | 99.999% | 99.999999999% (11 nines) |
Per-tool S3 storage architecture#
1. Loki (logs)#
[App Pod] → stdout
→ [Promtail/Alloy] → log collection
→ [Loki Ingester] → buffer chunks in memory (WAL)
→ threshold reached → flush chunks to S3
- Ingester accumulates logs in memory, then flushes in bulk to S3
- PVC used minimally (WAL only) or not at all
- Queries read directly from S3 (via the Querier component)
# Loki S3 configuration
loki:
storage:
type: s3
s3:
region: ap-northeast-2
bucketnames: goormgb-loki-chunks
2. Tempo (traces)#
[App Pod] → OTLP (gRPC:4317)
→ [OTel Collector] → trace collection
→ [Tempo Ingester] → buffer blocks in memory (WAL)
→ threshold reached → flush blocks to S3
- Nearly the same pattern as Loki
- Trace data stored in S3 in block units
- Queries read from S3
# Tempo S3 configuration
tempo:
storage:
trace:
backend: s3
s3:
bucket: goormgb-tempo-traces
region: ap-northeast-2
3. Thanos (long-term metrics)#
Thanos follows a different pattern from Loki/Tempo. Prometheus continues to use a PVC, and the Thanos Sidecar copies TSDB blocks to S3.
[Prometheus] → local TSDB (PVC, retains recent data for a few days)
→ [Thanos Sidecar] → uploads TSDB blocks to S3 every 2 hours
Query path:
[Thanos Query]
→ recent data: query Prometheus directly (via Sidecar)
→ historical data: query S3 (via Thanos Store Gateway)
- Prometheus still uses a PVC (for recent data and WAL)
- Thanos Sidecar uploads completed TSDB blocks to S3
- Historical queries are served by Thanos Store Gateway reading from S3
# Thanos S3 configuration (Prometheus values)
prometheus:
prometheusSpec:
thanos:
objectStorageConfig:
secret:
type: S3
config:
bucket: goormgb-thanos-metrics
region: ap-northeast-2
Overall data flow#
┌─── S3: goormgb-loki-chunks ───┐
[Loki Ingester] ───▶ │ log chunks │
│ Lifecycle: 30d → Glacier │
│ 400d → delete │
└─────────────────────────────────┘
┌─── S3: goormgb-tempo-traces ──┐
[Tempo Ingester] ───▶ │ trace blocks │
│ Lifecycle: 30d → Glacier │
│ 400d → delete │
└─────────────────────────────────┘
┌─── S3: goormgb-thanos-metrics ─┐
[Thanos Sidecar] ───▶ │ TSDB blocks (2h intervals) │
│ Lifecycle: 30d → Glacier │
│ 400d → delete │
└─────────────────────────────────┘
S3 Lifecycle Policy#
S3 Lifecycle policies applied for cost optimization:
| Period | Storage class | Cost (per 100GB) |
|---|---|---|
| 0–30 days | S3 Standard | ~$2.3/month |
| 30–400 days | S3 Glacier | ~$0.4/month |
| After 400 days | Deleted | $0 |
Recent data remains fast to query, while older data automatically transitions to Glacier to reduce costs.
S3 Access: IRSA#
Pods access S3 using IRSA (IAM Roles for Service Accounts) rather than static AWS access keys.
[Loki Pod]
→ ServiceAccount: loki-sa
→ annotation: eks.amazonaws.com/role-arn: arn:aws:iam::role/loki-irsa
→ IAM Role: S3 PutObject/GetObject permissions only
- Temporary credentials issued via OIDC, no static keys
- Least-privilege access per pod
- No key rotation required
Summary#
| Tool | Local storage (PVC) | S3 storage | Write pattern |
|---|---|---|---|
| Loki | WAL only (minimal) | All log chunks | Ingester flushes directly to S3 |
| Tempo | WAL only (minimal) | All trace blocks | Ingester flushes directly to S3 |
| Thanos | Prometheus PVC (recent data) | TSDB block copies | Sidecar uploads every 2 hours |
Key reasons for choosing S3-backed storage:
- Avoids PVC AZ mismatch on Staging Spot instances
- Eliminates manual disk expansion as data grows
- ~70% cost reduction compared to EBS (with Lifecycle policies)
- 11 nines durability for data safety