Problem#

Two major issues encountered while building an EKS 1.34 cluster, along with their resolutions.


1. Node Group Creation Failure - kubelet Label Issue#

Symptoms#

  • Node Group status: CREATE_FAILED
  • Error message: NodeCreationFailure - Unhealthy nodes in the kubernetes cluster
  • EC2 instances are running but not joining the cluster

Root Cause Analysis#

Connected to the node via SSM and checked kubelet logs:

sudo journalctl -u kubelet -n 50

kubelet error:

unknown 'kubernetes.io' or 'k8s.io' labels specified with --node-labels

Root Cause#

The Terraform EKS module was using node-role.kubernetes.io/infra as a node group label:

labels = {
  "node-role.kubernetes.io/infra" = "true"  # <- this is the problem!
  "role" = "infra"
}

Why is this a problem?

  • kubelet forbids setting labels in the kubernetes.io or k8s.io namespaces during node registration
  • This is a security measure — these namespaces are reserved and managed by the Kubernetes system
  • Allowing nodes to self-assign labels in these namespaces could lead to privilege escalation vulnerabilities

Fix#

Remove node-role.kubernetes.io/* labels:

# Fixed code
labels = {
  "arch"          = "amd64"
  "role"          = "infra"      # use plain labels instead
  "capacity-type" = "on-demand"
  "workload"      = "infra"
  "owner"         = var.owner_name
}

Affected Files#

  • modules/eks/main.tf - monitoring, infra, apps node group labels
  • staging/charts/karpenter/templates/nodepool.yaml - Karpenter NodePool labels

2. Node NotReady - Missing vpc-cni Addon#

Symptoms#

  • All nodes in NotReady state
  • kubectl describe node shows:
    NetworkNotReady: network is not ready: container runtime network not ready
    NetworkPluginNotReady: cni plugin not initialized
    

Root Cause Analysis#

Check kube-system pods:

kubectl get pods -n kube-system

aws-node (vpc-cni) pods are missing!

Check addon list:

aws eks list-addons --cluster-name goormgb-staging-eks --profile wonny

Result:

{
    "addons": ["aws-ebs-csi-driver"]  // vpc-cni, kube-proxy, coredns missing!
}

Root Cause#

During terraform apply, the addons defined in the cluster_addons block were not created.

cluster_addons = {
  vpc-cni = { most_recent = true }
  kube-proxy = { most_recent = true }
  coredns = { most_recent = true }
}

Checking terraform state:

terraform state list | grep addon
  • data.aws_eks_addon_version.this["vpc-cni"] - version data only
  • aws_eks_addon.this["vpc-cni"] - actual resource missing

Fix#

Destroy and re-apply EKS:

terraform destroy -target=module.eks
terraform apply -target=module.eks

Required EKS Addons#

AddonRole
vpc-cniPod networking (ENI allocation) - nodes are NotReady without it
kube-proxyService networking (iptables/ipvs)
corednsDNS service
ebs-csi-driverEBS volume provisioning

Additional Configuration - Addon Tolerations#

For addons to run on tainted nodes, tolerations are required:

cluster_addons = {
  vpc-cni = {
    most_recent = true
    configuration_values = jsonencode({
      tolerations = [{
        key      = "role"
        operator = "Exists"
        effect   = "NoSchedule"
      }]
    })
  }
  coredns = {
    most_recent = true
    configuration_values = jsonencode({
      tolerations = [{
        key      = "role"
        operator = "Equal"
        value    = "infra"
        effect   = "NoSchedule"
      }]
      nodeSelector = {
        role = "infra"
      }
    })
  }
}

Lessons Learned#

  1. No kubernetes.io namespace labels: kubelet rejects them. Use plain labels like role, workload, etc.

  2. Always verify EKS addons: Use aws eks list-addons to confirm required addons are present.

  3. Use SSM for debugging: When nodes have issues, connect via SSM and check journalctl -u kubelet.

  4. Verify terraform state: What shows in plan may differ from what actually gets created.


Reference Commands#

# Check node status
kubectl get nodes -w
kubectl describe node <node-name>

# Check addons
aws eks list-addons --cluster-name <cluster> --profile <profile>
aws eks describe-addon --cluster-name <cluster> --addon-name vpc-cni

# SSM access
aws ssm start-session --target <instance-id> --profile <profile>

# kubelet logs
sudo journalctl -u kubelet -n 100

# EC2 console output (boot logs)
aws ec2 get-console-output --instance-id <id> --latest --output text

Complete EKS Addons Guide#

What Are Addons?#

EKS Addons are software components that provide core operational capabilities for a Kubernetes cluster. They are managed by AWS, with version compatibility validation and security patches applied automatically.

Benefits of using addons:

  • AWS-managed: automated version updates and security patches
  • Guaranteed EKS version compatibility
  • IAM Role for Service Account (IRSA) integration
  • Customizable configuration values

Required Addons (Mandatory for cluster operation)#

1. Amazon VPC CNI (vpc-cni)#

Role: Handles Pod networking. Assigns real VPC IP addresses to each Pod.

How it works:

  • Attaches ENI (Elastic Network Interface) to nodes
  • Assigns secondary IPs from the ENI to Pods
  • Allows Pods to communicate directly with other VPC resources

Without it:

  • Nodes remain in NotReady state
  • Pods cannot receive IPs and cannot be scheduled
  • Container network initialization fails

Deployed as DaemonSet: aws-node Pod runs on every node

# Key configuration
configuration_values:
  env:
    ENABLE_PREFIX_DELEGATION: "true"  # increase Pod density
    WARM_IP_TARGET: "5"               # reserved IP count
  tolerations:
    - operator: Exists                # run on all nodes

2. CoreDNS (coredns)#

Role: DNS service inside the cluster. Resolves Service names to IPs.

How it works:

  • Exposed as the kube-dns service (default ClusterIP: 10.100.0.10)
  • Pod’s /etc/resolv.conf references this DNS server
  • Resolves Service names → ClusterIP
  • Forwards external domains to upstream DNS

Without it:

  • Service discovery like curl http://my-service fails
  • Must use IPs directly for Pod-to-Pod communication
  • External domain resolution may also fail (depending on config)

Deployed as Deployment: typically 2 replicas (HA)

# Key configuration
configuration_values:
  replicaCount: 2
  tolerations:
    - key: "role"
      value: "infra"
      effect: "NoSchedule"
  nodeSelector:
    role: infra

3. kube-proxy (kube-proxy)#

Role: Handles Service networking. Routes Service ClusterIP traffic to Pod IPs.

How it works:

  • Manages iptables/ipvs rules on each node
  • Distributes traffic arriving at a Service’s ClusterIP to backend Pods
  • Also handles NodePort and LoadBalancer type Services

Without it:

  • ClusterIP Services become inaccessible
  • Service → Pod load balancing stops working
  • Some network policies may not function

Deployed as DaemonSet: runs on every node

# Key configuration
configuration_values:
  mode: "iptables"  # or "ipvs"
  tolerations:
    - operator: Exists

4. Amazon EBS CSI Driver (aws-ebs-csi-driver)#

Role: Dynamic provisioning of EBS volumes. Automatically creates EBS volumes from PersistentVolumeClaims.

How it works:

  • Implements the CSI (Container Storage Interface) standard
  • StorageClass-based volume provisioning
  • Supports EBS types: gp3, io1, io2, etc.
  • Supports snapshots and resizing

Without it:

  • PVC creation fails to provision a volume (stays Pending)
  • StatefulSets cannot be used
  • Data persistence is not achievable

Components:

  • Controller: Deployment (2 replicas)
  • Node: DaemonSet (one per node)
# Key configuration
configuration_values:
  controller:
    tolerations:
      - key: "role"
        value: "infra"
        effect: "NoSchedule"
    nodeSelector:
      role: infra
  node:
    tolerations:
      - operator: Exists  # run on all nodes

5. Amazon EFS CSI Driver (aws-efs-csi-driver)#

Role: Mount EFS filesystems. Supports simultaneous read/write from multiple Pods.

Use cases:

  • Shared file storage (ReadWriteMany)
  • File sharing between containers
  • Serving static content

Without it:

  • Cannot mount EFS
  • Cannot use ReadWriteMany volumes

6. AWS Load Balancer Controller (installed separately)#

Role: Automatically provisions ALB/NLB. Handles Ingress and Service (type: LoadBalancer).

Note: Not an EKS Addon — must be installed separately via Helm Chart!

# Install via Helm
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
  -n kube-system \
  --set clusterName=<cluster-name>

Optional Addons#

7. Amazon CloudWatch Observability (amazon-cloudwatch-observability)#

Role: Send container metrics and logs to CloudWatch.

Components:

  • CloudWatch Agent: metrics collection
  • Fluent Bit: log collection
  • ADOT Collector: tracing (optional)

8. Amazon GuardDuty Agent (aws-guardduty-agent)#

Role: Runtime threat detection. Monitors containers for anomalous behavior.


9. AWS Distro for OpenTelemetry (adot)#

Role: Distributed tracing and metrics collection. Sends to X-Ray, CloudWatch, Prometheus, etc.


10. Karpenter (installed separately)#

Role: Node autoscaling. Detects Pending Pods → provisions optimal instances.

Not an EKS Addon! Install via Helm separately:

helm install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --namespace karpenter

Addon Management Commands#

# List available addons
aws eks describe-addon-versions --kubernetes-version 1.34 \
  --query 'addons[].{name:addonName,versions:addonVersions[0].addonVersion}' \
  --output table

# List installed addons
aws eks list-addons --cluster-name <cluster>

# Addon details
aws eks describe-addon --cluster-name <cluster> --addon-name vpc-cni

# Install addon
aws eks create-addon --cluster-name <cluster> --addon-name vpc-cni

# Update addon
aws eks update-addon --cluster-name <cluster> --addon-name vpc-cni \
  --addon-version v1.21.1-eksbuild.5

# Delete addon
aws eks delete-addon --cluster-name <cluster> --addon-name vpc-cni

Addon Configuration in Terraform#

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_addons = {
    # Required
    vpc-cni = {
      most_recent = true
      configuration_values = jsonencode({
        tolerations = [{ operator = "Exists" }]
      })
    }
    kube-proxy = {
      most_recent = true
    }
    coredns = {
      most_recent = true
      configuration_values = jsonencode({
        tolerations = [{
          key    = "role"
          value  = "infra"
          effect = "NoSchedule"
        }]
      })
    }

    # Recommended
    aws-ebs-csi-driver = {
      most_recent              = true
      service_account_role_arn = module.ebs_csi_irsa.iam_role_arn
    }
  }
}

Checking Addon Version Compatibility#

# List supported addon versions for a specific K8s version
aws eks describe-addon-versions \
  --addon-name vpc-cni \
  --kubernetes-version 1.34 \
  --query 'addons[].addonVersions[].addonVersion'

Notes:

  • Check addon compatibility before upgrading EKS version
  • most_recent = true automatically selects the latest compatible version
  • Pin versions in production environments

Addon Troubleshooting#

When an addon is in Degraded state:

# Check status
aws eks describe-addon --cluster-name <cluster> --addon-name vpc-cni \
  --query 'addon.health'

# Check Pod status
kubectl get pods -n kube-system -l k8s-app=aws-node
kubectl describe pod -n kube-system <aws-node-pod>

# Check events
kubectl get events -n kube-system --sort-by='.lastTimestamp'

Common issues:

  1. Insufficient IRSA role permissions → check IAM policies
  2. Missing toleration → addon cannot run if all nodes have taints
  3. Insufficient resources → adjust requests/limits
  4. Image pull failure → check ECR permissions or network