EKS Troubleshooting - Node Group Creation Failure and Missing vpc-cni Addon
Problem#
Two major issues encountered while building an EKS 1.34 cluster, along with their resolutions.
1. Node Group Creation Failure - kubelet Label Issue#
Symptoms#
- Node Group status:
CREATE_FAILED - Error message:
NodeCreationFailure - Unhealthy nodes in the kubernetes cluster - EC2 instances are running but not joining the cluster
Root Cause Analysis#
Connected to the node via SSM and checked kubelet logs:
sudo journalctl -u kubelet -n 50
kubelet error:
unknown 'kubernetes.io' or 'k8s.io' labels specified with --node-labels
Root Cause#
The Terraform EKS module was using node-role.kubernetes.io/infra as a node group label:
labels = {
"node-role.kubernetes.io/infra" = "true" # <- this is the problem!
"role" = "infra"
}
Why is this a problem?
- kubelet forbids setting labels in the
kubernetes.ioork8s.ionamespaces during node registration - This is a security measure — these namespaces are reserved and managed by the Kubernetes system
- Allowing nodes to self-assign labels in these namespaces could lead to privilege escalation vulnerabilities
Fix#
Remove node-role.kubernetes.io/* labels:
# Fixed code
labels = {
"arch" = "amd64"
"role" = "infra" # use plain labels instead
"capacity-type" = "on-demand"
"workload" = "infra"
"owner" = var.owner_name
}
Affected Files#
modules/eks/main.tf- monitoring, infra, apps node group labelsstaging/charts/karpenter/templates/nodepool.yaml- Karpenter NodePool labels
2. Node NotReady - Missing vpc-cni Addon#
Symptoms#
- All nodes in
NotReadystate kubectl describe nodeshows:NetworkNotReady: network is not ready: container runtime network not ready NetworkPluginNotReady: cni plugin not initialized
Root Cause Analysis#
Check kube-system pods:
kubectl get pods -n kube-system
aws-node (vpc-cni) pods are missing!
Check addon list:
aws eks list-addons --cluster-name goormgb-staging-eks --profile wonny
Result:
{
"addons": ["aws-ebs-csi-driver"] // vpc-cni, kube-proxy, coredns missing!
}
Root Cause#
During terraform apply, the addons defined in the cluster_addons block were not created.
cluster_addons = {
vpc-cni = { most_recent = true }
kube-proxy = { most_recent = true }
coredns = { most_recent = true }
}
Checking terraform state:
terraform state list | grep addon
data.aws_eks_addon_version.this["vpc-cni"]- version data onlyaws_eks_addon.this["vpc-cni"]- actual resource missing
Fix#
Destroy and re-apply EKS:
terraform destroy -target=module.eks
terraform apply -target=module.eks
Required EKS Addons#
| Addon | Role |
|---|---|
| vpc-cni | Pod networking (ENI allocation) - nodes are NotReady without it |
| kube-proxy | Service networking (iptables/ipvs) |
| coredns | DNS service |
| ebs-csi-driver | EBS volume provisioning |
Additional Configuration - Addon Tolerations#
For addons to run on tainted nodes, tolerations are required:
cluster_addons = {
vpc-cni = {
most_recent = true
configuration_values = jsonencode({
tolerations = [{
key = "role"
operator = "Exists"
effect = "NoSchedule"
}]
})
}
coredns = {
most_recent = true
configuration_values = jsonencode({
tolerations = [{
key = "role"
operator = "Equal"
value = "infra"
effect = "NoSchedule"
}]
nodeSelector = {
role = "infra"
}
})
}
}
Lessons Learned#
No kubernetes.io namespace labels: kubelet rejects them. Use plain labels like
role,workload, etc.Always verify EKS addons: Use
aws eks list-addonsto confirm required addons are present.Use SSM for debugging: When nodes have issues, connect via SSM and check
journalctl -u kubelet.Verify terraform state: What shows in
planmay differ from what actually gets created.
Reference Commands#
# Check node status
kubectl get nodes -w
kubectl describe node <node-name>
# Check addons
aws eks list-addons --cluster-name <cluster> --profile <profile>
aws eks describe-addon --cluster-name <cluster> --addon-name vpc-cni
# SSM access
aws ssm start-session --target <instance-id> --profile <profile>
# kubelet logs
sudo journalctl -u kubelet -n 100
# EC2 console output (boot logs)
aws ec2 get-console-output --instance-id <id> --latest --output text
Complete EKS Addons Guide#
What Are Addons?#
EKS Addons are software components that provide core operational capabilities for a Kubernetes cluster. They are managed by AWS, with version compatibility validation and security patches applied automatically.
Benefits of using addons:
- AWS-managed: automated version updates and security patches
- Guaranteed EKS version compatibility
- IAM Role for Service Account (IRSA) integration
- Customizable configuration values
Required Addons (Mandatory for cluster operation)#
1. Amazon VPC CNI (vpc-cni)#
Role: Handles Pod networking. Assigns real VPC IP addresses to each Pod.
How it works:
- Attaches ENI (Elastic Network Interface) to nodes
- Assigns secondary IPs from the ENI to Pods
- Allows Pods to communicate directly with other VPC resources
Without it:
- Nodes remain in
NotReadystate - Pods cannot receive IPs and cannot be scheduled
- Container network initialization fails
Deployed as DaemonSet: aws-node Pod runs on every node
# Key configuration
configuration_values:
env:
ENABLE_PREFIX_DELEGATION: "true" # increase Pod density
WARM_IP_TARGET: "5" # reserved IP count
tolerations:
- operator: Exists # run on all nodes
2. CoreDNS (coredns)#
Role: DNS service inside the cluster. Resolves Service names to IPs.
How it works:
- Exposed as the
kube-dnsservice (default ClusterIP: 10.100.0.10) - Pod’s
/etc/resolv.confreferences this DNS server - Resolves Service names → ClusterIP
- Forwards external domains to upstream DNS
Without it:
- Service discovery like
curl http://my-servicefails - Must use IPs directly for Pod-to-Pod communication
- External domain resolution may also fail (depending on config)
Deployed as Deployment: typically 2 replicas (HA)
# Key configuration
configuration_values:
replicaCount: 2
tolerations:
- key: "role"
value: "infra"
effect: "NoSchedule"
nodeSelector:
role: infra
3. kube-proxy (kube-proxy)#
Role: Handles Service networking. Routes Service ClusterIP traffic to Pod IPs.
How it works:
- Manages iptables/ipvs rules on each node
- Distributes traffic arriving at a Service’s ClusterIP to backend Pods
- Also handles NodePort and LoadBalancer type Services
Without it:
- ClusterIP Services become inaccessible
- Service → Pod load balancing stops working
- Some network policies may not function
Deployed as DaemonSet: runs on every node
# Key configuration
configuration_values:
mode: "iptables" # or "ipvs"
tolerations:
- operator: Exists
Recommended Addons (Needed for production)#
4. Amazon EBS CSI Driver (aws-ebs-csi-driver)#
Role: Dynamic provisioning of EBS volumes. Automatically creates EBS volumes from PersistentVolumeClaims.
How it works:
- Implements the CSI (Container Storage Interface) standard
- StorageClass-based volume provisioning
- Supports EBS types: gp3, io1, io2, etc.
- Supports snapshots and resizing
Without it:
- PVC creation fails to provision a volume (stays Pending)
- StatefulSets cannot be used
- Data persistence is not achievable
Components:
- Controller: Deployment (2 replicas)
- Node: DaemonSet (one per node)
# Key configuration
configuration_values:
controller:
tolerations:
- key: "role"
value: "infra"
effect: "NoSchedule"
nodeSelector:
role: infra
node:
tolerations:
- operator: Exists # run on all nodes
5. Amazon EFS CSI Driver (aws-efs-csi-driver)#
Role: Mount EFS filesystems. Supports simultaneous read/write from multiple Pods.
Use cases:
- Shared file storage (ReadWriteMany)
- File sharing between containers
- Serving static content
Without it:
- Cannot mount EFS
- Cannot use ReadWriteMany volumes
6. AWS Load Balancer Controller (installed separately)#
Role: Automatically provisions ALB/NLB. Handles Ingress and Service (type: LoadBalancer).
Note: Not an EKS Addon — must be installed separately via Helm Chart!
# Install via Helm
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
-n kube-system \
--set clusterName=<cluster-name>
Optional Addons#
7. Amazon CloudWatch Observability (amazon-cloudwatch-observability)#
Role: Send container metrics and logs to CloudWatch.
Components:
- CloudWatch Agent: metrics collection
- Fluent Bit: log collection
- ADOT Collector: tracing (optional)
8. Amazon GuardDuty Agent (aws-guardduty-agent)#
Role: Runtime threat detection. Monitors containers for anomalous behavior.
9. AWS Distro for OpenTelemetry (adot)#
Role: Distributed tracing and metrics collection. Sends to X-Ray, CloudWatch, Prometheus, etc.
10. Karpenter (installed separately)#
Role: Node autoscaling. Detects Pending Pods → provisions optimal instances.
Not an EKS Addon! Install via Helm separately:
helm install karpenter oci://public.ecr.aws/karpenter/karpenter \
--namespace karpenter
Addon Management Commands#
# List available addons
aws eks describe-addon-versions --kubernetes-version 1.34 \
--query 'addons[].{name:addonName,versions:addonVersions[0].addonVersion}' \
--output table
# List installed addons
aws eks list-addons --cluster-name <cluster>
# Addon details
aws eks describe-addon --cluster-name <cluster> --addon-name vpc-cni
# Install addon
aws eks create-addon --cluster-name <cluster> --addon-name vpc-cni
# Update addon
aws eks update-addon --cluster-name <cluster> --addon-name vpc-cni \
--addon-version v1.21.1-eksbuild.5
# Delete addon
aws eks delete-addon --cluster-name <cluster> --addon-name vpc-cni
Addon Configuration in Terraform#
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.0"
cluster_addons = {
# Required
vpc-cni = {
most_recent = true
configuration_values = jsonencode({
tolerations = [{ operator = "Exists" }]
})
}
kube-proxy = {
most_recent = true
}
coredns = {
most_recent = true
configuration_values = jsonencode({
tolerations = [{
key = "role"
value = "infra"
effect = "NoSchedule"
}]
})
}
# Recommended
aws-ebs-csi-driver = {
most_recent = true
service_account_role_arn = module.ebs_csi_irsa.iam_role_arn
}
}
}
Checking Addon Version Compatibility#
# List supported addon versions for a specific K8s version
aws eks describe-addon-versions \
--addon-name vpc-cni \
--kubernetes-version 1.34 \
--query 'addons[].addonVersions[].addonVersion'
Notes:
- Check addon compatibility before upgrading EKS version
most_recent = trueautomatically selects the latest compatible version- Pin versions in production environments
Addon Troubleshooting#
When an addon is in Degraded state:
# Check status
aws eks describe-addon --cluster-name <cluster> --addon-name vpc-cni \
--query 'addon.health'
# Check Pod status
kubectl get pods -n kube-system -l k8s-app=aws-node
kubectl describe pod -n kube-system <aws-node-pod>
# Check events
kubectl get events -n kube-system --sort-by='.lastTimestamp'
Common issues:
- Insufficient IRSA role permissions → check IAM policies
- Missing toleration → addon cannot run if all nodes have taints
- Insufficient resources → adjust requests/limits
- Image pull failure → check ECR permissions or network