EKS Node Cluster Join Failure Checklist
Problem#
- EKS node group created but nodes are not joining the cluster
kubectl get nodesshows no nodes- EC2 instances are in Running state
Checklist#
1. VPC DNS Settings#
| Setting | Required Value | Check |
|---|---|---|
enable_dns_hostnames | true | |
enable_dns_support | true |
Verification:
resource "aws_vpc" "main" {
enable_dns_hostnames = true
enable_dns_support = true
}
2. Subnet Tags#
The EKS controller requires specific tags to recognize subnets
| Subnet | Tag | Check |
|---|---|---|
| Public | kubernetes.io/role/elb = 1 | |
| Private | kubernetes.io/role/internal-elb = 1 | |
| Both | kubernetes.io/cluster/<cluster-name> = shared |
Verification:
# modules/vpc/main.tf - Public Subnet
tags = {
"kubernetes.io/role/elb" = "1"
"kubernetes.io/cluster/${var.eks_cluster_name}" = "shared"
}
# modules/vpc/main.tf - Private Subnet
tags = {
"kubernetes.io/role/internal-elb" = "1"
"kubernetes.io/cluster/${var.eks_cluster_name}" = "shared"
}
3. Subnet Settings#
| Subnet | Setting | Required Value | Check |
|---|---|---|---|
| Public | map_public_ip_on_launch | true | |
| Private | map_public_ip_on_launch | false (default) |
4. Security Group Communication#
Required ports between nodes and the control plane
| Direction | Port | Purpose | Check |
|---|---|---|---|
| Node → Control Plane | 443 | API Server | |
| Control Plane → Node | 10250 | Kubelet | |
| Node ↔ Node | All | Pod communication |
When using terraform-aws-modules/eks:
# Required SG rules are created automatically
attach_cluster_primary_security_group = true
5. Cluster Endpoint Settings#
| Setting | Recommended | Description | Check |
|---|---|---|---|
cluster_endpoint_private_access | true | Private nodes access API via internal network | |
cluster_endpoint_public_access | true | External kubectl access |
Note: If private_access = false, nodes must go through NAT → public internet, which complicates routing.
6. VPC Endpoints (Required for AL2023)#
eks-auth VPC Endpoint (PrivateLink)#
| Item | Description |
|---|---|
| Service name | com.amazonaws.<region>.eks-auth |
| Type | Interface (PrivateLink) |
| Required | Mandatory when using AL2023 AMI |
Why:
- AL2023 uses
nodeadm(replaces legacybootstrap.sh) nodeadmuses EKS Pod Identity for authenticationeks-authAPI is PrivateLink-only (not reachable via NAT Gateway)
Symptom:
- Node bootstrap log shows
nodeadm: done! - But node never registers with the cluster
Fix:
# config.yaml
vpc_endpoints:
- eks-auth
Other VPC Endpoints#
| Endpoint | Type | NAT Alternative | Cost |
|---|---|---|---|
| S3 | Gateway | Possible | Free |
| ECR (ecr.api, ecr.dkr) | Interface | Possible | Paid |
| STS | Interface | Possible | Paid |
| eks-auth | Interface | Not possible | Paid |
Conclusion: When using NAT Gateway, only eks-auth is mandatory; the rest are optional.
7. AMI Type and User Data#
AL2 vs AL2023 Differences#
| Item | AL2 | AL2023 |
|---|---|---|
| Bootstrap | bootstrap.sh | nodeadm |
| Authentication | IAM Role | EKS Pod Identity |
| eks-auth required | No | Yes (mandatory) |
| AMI type | AL2_ARM_64 | AL2023_ARM_64_STANDARD |
User Data Conflict Check#
- Custom
user_datacan conflict with AL2023 bootstrap - When using terraform-aws-modules/eks, it’s handled automatically — best not to override
Verification:
# modules/eks/main.tf - Node Group
infra = {
ami_type = "AL2023_ARM_64_STANDARD"
# no custom user_data or launch_template
}
8. IAM Policies#
Required policies for the node IAM Role
| Policy | Purpose | Check |
|---|---|---|
AmazonEKSWorkerNodePolicy | Basic node policy | (auto via module) |
AmazonEKS_CNI_Policy | VPC CNI | (auto via module) |
AmazonEC2ContainerRegistryReadOnly | ECR pull | (auto via module) |
AmazonEKSWorkerNodeMinimalPolicy | AL2023 minimal permissions | |
AmazonEBSCSIDriverPolicy | EBS CSI |
9. EKS Access Entries#
Authentication method for EKS 1.30+
| Type | Purpose |
|---|---|
EC2_LINUX | Node group (auto-created) |
STANDARD | User/role access |
SSO access configuration:
# config.yaml
eks:
access_entries:
devops_sso:
enabled: true
role_arn: "arn:aws:iam::ACCOUNT_ID:role/aws-reserved/sso.amazonaws.com/REGION/AWSReservedSSO_..."
Debugging Commands#
Check EC2 instance logs#
# Connect to node via SSM
aws ssm start-session --target <instance-id>
# Bootstrap logs
sudo journalctl -u nodeadm -f
sudo cat /var/log/cloud-init-output.log
# kubelet status
sudo systemctl status kubelet
sudo journalctl -u kubelet -f
Check cluster status#
# Node list
kubectl get nodes
# Cluster info
kubectl cluster-info
# Check Access Entries
aws eks list-access-entries --cluster-name <cluster-name>
Final Checklist#
- VPC DNS settings (
enable_dns_hostnames,enable_dns_support) - Subnet tags (
kubernetes.io/role/elb,kubernetes.io/cluster/...) - Public subnet
map_public_ip_on_launch = true - Security group allows ports 443 and 10250
-
cluster_endpoint_private_access = true -
eks-authVPC Endpoint created (required for AL2023) - AMI type confirmed (
AL2023_ARM_64_STANDARD) - No custom User Data
- IAM policies include
AmazonEKSWorkerNodeMinimalPolicy
Notes#
- Missing
eks-authVPC Endpoint is the primary cause for EKS 1.35 + AL2023 combinations - NAT Gateway alone cannot reach the
eks-authAPI (PrivateLink only) - terraform-aws-modules/eks v20.x handles most settings automatically