Route53 Record Deletion + Secrets Manager Reset
2026-04-10 | Route53 deletion and Secret reset triggered by a team member’s terraform apply
Problem#
- Route53 records deleted: DNS records disappeared after EKS restart/scale-down
- Secrets Manager reset: DB/Redis passwords and other secrets were overwritten on
terraform apply
Root Cause 1: external-dns policy: sync#
The issue#
policy: sync # allows record deletion
policy: sync deletes Route53 records when the corresponding Kubernetes Ingress/Service is removed.
Bringing down or scaling in EKS → Ingress deleted → external-dns deletes Route53 records.
Flow#
EKS shutdown/scale-in
→ Ingress resources deleted
→ external-dns: "need to delete DNS records tied to this Ingress" (sync policy)
→ Route53 records deleted: argocd.playball.one, grafana.playball.one, etc.
→ Services become unreachable
Fix#
# Both staging and prod
policy: upsert-only # create/update only, never delete
Modified files#
| File | Branch |
|---|---|
303-goormgb-k8s-helm/staging/values/infra/values-external-dns.yaml | develop, argocd-sync/staging |
303-goormgb-k8s-helm/prod/values/infra/values-external-dns.yaml | develop, argocd-sync/prod |
upsert-only vs sync#
| sync | upsert-only | |
|---|---|---|
| Create records | O | O |
| Update records | O | O |
| Delete records | O (dangerous) | X |
| EKS restart | records deleted | records preserved |
Notes#
- With
upsert-only, stale records may accumulate and need manual cleanup - If automatic record deletion is required, use
syncwith--txt-owner-idfor ownership tracking — but in environments where EKS is frequently cycled up and down,upsert-onlyis the safer choice
Root Cause 2: Secrets Manager secret_version overwrite#
The issue (Prod)#
# prod/secrets.tf (before fix)
resource "aws_secretsmanager_secret_version" "discord" {
secret_string = jsonencode(var.common_secrets["prod/monitoring/discord-webhook-alerts"])
# no lifecycle ignore_changes → overwrites with tfvars value on every apply
}
On every terraform apply:
- If
var.common_secretshas a value → overwrites with that value - If
terraform.tfvarsis missing or the key is absent → overwrites with an empty value - Any manual changes made in the AWS Console are lost
Why was staging unaffected?#
# staging/secrets.tf (already safe)
resource "aws_secretsmanager_secret_version" "kafka" {
secret_string = jsonencode({...})
lifecycle { ignore_changes = [secret_string] } # already present
}
Staging already had ignore_changes = [secret_string], so changes after initial creation were ignored.
Prod was missing this.
Fix#
# prod/secrets.tf (after fix)
# 1. Prevent accidental secret deletion
resource "aws_secretsmanager_secret" "this" {
lifecycle {
prevent_destroy = true # added
ignore_changes = [description]
}
}
# 2. Manually managed secrets: ignore value changes
resource "aws_secretsmanager_secret_version" "discord" {
secret_string = jsonencode(...)
lifecycle { ignore_changes = [secret_string] } # added
}
# 3. Infrastructure-linked secrets (RDS/Redis endpoints): do NOT ignore
# → new endpoint must be reflected when RDS/Redis is recreated
resource "aws_secretsmanager_secret_version" "ai_postgres" {
secret_string = jsonencode({
host = module.rds.address # dynamic
password = module.rds.master_password # dynamic
})
# no ignore_changes — intentional
}
Modified files#
| File | Change |
|---|---|
301-goormgb-terraform/environments/prod/secrets.tf | Added prevent_destroy + ignore_changes |
301-goormgb-terraform/environments/prod/main.tf | Added prevent_destroy to redis secret |
301-goormgb-terraform/environments/staging/secrets.tf | Added prevent_destroy (ignore_changes was already present) |
Secret protection policy summary#
| Secret type | prevent_destroy | ignore_changes | Reason |
|---|---|---|---|
| Manually managed (Discord, OAuth, Mail, etc.) | O | O | Managed via Console; Terraform should not touch it |
| Kafka defaults | O | O | No changes needed after initial setup |
| AI PostgreSQL (host, password) | O | X | Must reflect new values if RDS is recreated |
| AI Redis (host, port) | O | X | Must reflect new endpoint if ElastiCache is recreated |
| Redis (services/redis) | O | X | ElastiCache endpoint is dynamic |
Preventive Measures#
- external-dns: Lock
upsert-onlypolicy. Do not switch tosync. - Secrets Manager: For any new secret, always include
prevent_destroy+ignore_changesfor manually managed values. - Before terraform apply: If
terraform planshows adestroyorsecret_stringchange, stop and review before applying.