K8s Taint vs NodeSelector Troubleshooting
Problem#
ArgoCD alerts triggered for Pods in dev-webs showing Degraded status:
- dev-order-core
- dev-seat
- dev-queue
Root Cause#
The app-dedicated node (mini-might) ran out of resources
Cause: The control-plane node (mini-gmk) has a node-role.kubernetes.io/control-plane:NoSchedule taint, which prevented infra Pods (istio, loki, calico, coredns, metrics-server, etc.) from being scheduled there. They all got pushed onto the app-only worker node instead.
Infra Pods scheduled on mini-might (worker):
- istio-ingressgateway (x2)
- istiod
- kiali
- loki-0, loki-chunks-cache-0, loki-results-cache-0
- calico-apiserver (x2), calico-kube-controllers, calico-typha
- coredns (x2)
- metrics-server
- alloy
→ Resource contention when deploying app Pods, new Pods go Pending
Detailed Analysis#
1. Pending Pods#
kubectl get pods -n dev-webs
New version Pods are Pending, previous version remains Running
2. Scheduling failure reason#
Warning FailedScheduling default-scheduler
0/2 nodes are available:
1 node(s) didn't match Pod's node affinity/selector,
1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }
mini-gmk(control-plane): has taint, scheduling blockedmini-might(worker): missingrole=applabel, nodeSelector mismatch
3. Over-allocated resources#
Allocated resources:
memory: 16076Mi (57%) requests, 28295Mi (101%) limits
- replicas set to 2 causes resource shortage
- Made worse by old and new Pod versions coexisting during RollingUpdate
Fix#
1. Reduce replicas (dev environment)#
# values-auth-guard.yaml, values-queue.yaml, values-seat.yaml
replicaCount: 2 → 1
2. Add node label#
kubectl label node mini-might role=app
3. Remove Taint (recommended)#
kubectl taint nodes mini-gmk node-role.kubernetes.io/control-plane:NoSchedule-
4. Separate nodes using NodeSelector#
| Category | nodeSelector |
|---|---|
| Apps (dev-webs, dev-ai) | kubernetes.io/hostname: mini-might |
| Infra (istio, monitoring, loki) | kubernetes.io/hostname: mini-gmk |
Taint vs NodeSelector Comparison#
| Method | Pros | Cons |
|---|---|---|
| Taint/Toleration | Enforced, prevents mistakes | Complex, requires toleration on every Pod |
| NodeSelector | Simple, explicit | Pods can land anywhere if nodeSelector is missing |
Conclusion#
- Remove the Taint: reduces management complexity
- Use NodeSelector only: simpler and more intuitive
- Explicitly separate apps to the worker node and infra to the control-plane node
Modified Files (303-goormgb-k8s-helm)#
dev/values/apps/#
values-auth-guard.yaml: replicaCount 2→1values-queue.yaml: replicaCount 2→1values-seat.yaml: replicaCount 2→1
dev/values/monitoring/#
values-loki.yaml: added nodeSelector
Useful Commands#
# Check taints
kubectl describe node mini-gmk | grep Taint
# Check nodeSelector mismatch
kubectl describe pod <pending-pod> | grep -A 5 Events
# Check node labels
kubectl get nodes --show-labels
# Reschedule infra Pods
kubectl delete pod -n istio-system -l app=istiod