K8s Taint vs NodeSelector Troubleshooting

Problem#

ArgoCD alerts triggered for Pods in dev-webs showing Degraded status:

dev-order-core
dev-seat
dev-queue

Root Cause#

The app-dedicated node (mini-might) ran out of resources

Cause: The control-plane node (mini-gmk) has a node-role.kubernetes.io/control-plane:NoSchedule taint, which prevented infra Pods (istio, loki, calico, coredns, metrics-server, etc.) from being scheduled there. They all got pushed onto the app-only worker node instead.

Infra Pods scheduled on mini-might (worker):
- istio-ingressgateway (x2)
- istiod
- kiali
- loki-0, loki-chunks-cache-0, loki-results-cache-0
- calico-apiserver (x2), calico-kube-controllers, calico-typha
- coredns (x2)
- metrics-server
- alloy

→ Resource contention when deploying app Pods, new Pods go Pending

Detailed Analysis#

1. Pending Pods#

kubectl get pods -n dev-webs

New version Pods are Pending, previous version remains Running

2. Scheduling failure reason#

Warning  FailedScheduling  default-scheduler
0/2 nodes are available:
  1 node(s) didn't match Pod's node affinity/selector,
  1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }

mini-gmk (control-plane): has taint, scheduling blocked
mini-might (worker): missing role=app label, nodeSelector mismatch

3. Over-allocated resources#

Allocated resources:
  memory: 16076Mi (57%) requests, 28295Mi (101%) limits

replicas set to 2 causes resource shortage
Made worse by old and new Pod versions coexisting during RollingUpdate

Fix#

1. Reduce replicas (dev environment)#

# values-auth-guard.yaml, values-queue.yaml, values-seat.yaml
replicaCount: 2 → 1

2. Add node label#

kubectl label node mini-might role=app

3. Remove Taint (recommended)#

kubectl taint nodes mini-gmk node-role.kubernetes.io/control-plane:NoSchedule-

4. Separate nodes using NodeSelector#

Category	nodeSelector
Apps (dev-webs, dev-ai)	`kubernetes.io/hostname: mini-might`
Infra (istio, monitoring, loki)	`kubernetes.io/hostname: mini-gmk`

Taint vs NodeSelector Comparison#

Method	Pros	Cons
Taint/Toleration	Enforced, prevents mistakes	Complex, requires toleration on every Pod
NodeSelector	Simple, explicit	Pods can land anywhere if nodeSelector is missing

Conclusion#

Remove the Taint: reduces management complexity
Use NodeSelector only: simpler and more intuitive
Explicitly separate apps to the worker node and infra to the control-plane node

Modified Files (303-goormgb-k8s-helm)#

dev/values/apps/#

values-auth-guard.yaml: replicaCount 2→1
values-queue.yaml: replicaCount 2→1
values-seat.yaml: replicaCount 2→1

dev/values/monitoring/#

values-loki.yaml: added nodeSelector

Useful Commands#

# Check taints
kubectl describe node mini-gmk | grep Taint

# Check nodeSelector mismatch
kubectl describe pod <pending-pod> | grep -A 5 Events

# Check node labels
kubectl get nodes --show-labels

# Reschedule infra Pods
kubectl delete pod -n istio-system -l app=istiod