2026-04-10 | Route53 deletion and Secret reset triggered by a team member’s terraform apply


Problem#

  1. Route53 records deleted: DNS records disappeared after EKS restart/scale-down
  2. Secrets Manager reset: DB/Redis passwords and other secrets were overwritten on terraform apply

Root Cause 1: external-dns policy: sync#

The issue#

policy: sync  # allows record deletion

policy: sync deletes Route53 records when the corresponding Kubernetes Ingress/Service is removed. Bringing down or scaling in EKS → Ingress deleted → external-dns deletes Route53 records.

Flow#

EKS shutdown/scale-in
  → Ingress resources deleted
  → external-dns: "need to delete DNS records tied to this Ingress" (sync policy)
  → Route53 records deleted: argocd.playball.one, grafana.playball.one, etc.
  → Services become unreachable

Fix#

# Both staging and prod
policy: upsert-only  # create/update only, never delete

Modified files#

FileBranch
303-goormgb-k8s-helm/staging/values/infra/values-external-dns.yamldevelop, argocd-sync/staging
303-goormgb-k8s-helm/prod/values/infra/values-external-dns.yamldevelop, argocd-sync/prod

upsert-only vs sync#

syncupsert-only
Create recordsOO
Update recordsOO
Delete recordsO (dangerous)X
EKS restartrecords deletedrecords preserved

Notes#

  • With upsert-only, stale records may accumulate and need manual cleanup
  • If automatic record deletion is required, use sync with --txt-owner-id for ownership tracking — but in environments where EKS is frequently cycled up and down, upsert-only is the safer choice

Root Cause 2: Secrets Manager secret_version overwrite#

The issue (Prod)#

# prod/secrets.tf (before fix)
resource "aws_secretsmanager_secret_version" "discord" {
  secret_string = jsonencode(var.common_secrets["prod/monitoring/discord-webhook-alerts"])
  # no lifecycle ignore_changes → overwrites with tfvars value on every apply
}

On every terraform apply:

  1. If var.common_secrets has a value → overwrites with that value
  2. If terraform.tfvars is missing or the key is absent → overwrites with an empty value
  3. Any manual changes made in the AWS Console are lost

Why was staging unaffected?#

# staging/secrets.tf (already safe)
resource "aws_secretsmanager_secret_version" "kafka" {
  secret_string = jsonencode({...})
  lifecycle { ignore_changes = [secret_string] }  # already present
}

Staging already had ignore_changes = [secret_string], so changes after initial creation were ignored. Prod was missing this.

Fix#

# prod/secrets.tf (after fix)

# 1. Prevent accidental secret deletion
resource "aws_secretsmanager_secret" "this" {
  lifecycle {
    prevent_destroy = true   # added
    ignore_changes  = [description]
  }
}

# 2. Manually managed secrets: ignore value changes
resource "aws_secretsmanager_secret_version" "discord" {
  secret_string = jsonencode(...)
  lifecycle { ignore_changes = [secret_string] }  # added
}

# 3. Infrastructure-linked secrets (RDS/Redis endpoints): do NOT ignore
#    → new endpoint must be reflected when RDS/Redis is recreated
resource "aws_secretsmanager_secret_version" "ai_postgres" {
  secret_string = jsonencode({
    host = module.rds.address      # dynamic
    password = module.rds.master_password  # dynamic
  })
  # no ignore_changes — intentional
}

Modified files#

FileChange
301-goormgb-terraform/environments/prod/secrets.tfAdded prevent_destroy + ignore_changes
301-goormgb-terraform/environments/prod/main.tfAdded prevent_destroy to redis secret
301-goormgb-terraform/environments/staging/secrets.tfAdded prevent_destroy (ignore_changes was already present)

Secret protection policy summary#

Secret typeprevent_destroyignore_changesReason
Manually managed (Discord, OAuth, Mail, etc.)OOManaged via Console; Terraform should not touch it
Kafka defaultsOONo changes needed after initial setup
AI PostgreSQL (host, password)OXMust reflect new values if RDS is recreated
AI Redis (host, port)OXMust reflect new endpoint if ElastiCache is recreated
Redis (services/redis)OXElastiCache endpoint is dynamic

Preventive Measures#

  1. external-dns: Lock upsert-only policy. Do not switch to sync.
  2. Secrets Manager: For any new secret, always include prevent_destroy + ignore_changes for manually managed values.
  3. Before terraform apply: If terraform plan shows a destroy or secret_string change, stop and review before applying.