Problem#

ArgoCD alerts triggered for Pods in dev-webs showing Degraded status:

  • dev-order-core
  • dev-seat
  • dev-queue

Root Cause#

The app-dedicated node (mini-might) ran out of resources

Cause: The control-plane node (mini-gmk) has a node-role.kubernetes.io/control-plane:NoSchedule taint, which prevented infra Pods (istio, loki, calico, coredns, metrics-server, etc.) from being scheduled there. They all got pushed onto the app-only worker node instead.

Infra Pods scheduled on mini-might (worker):
- istio-ingressgateway (x2)
- istiod
- kiali
- loki-0, loki-chunks-cache-0, loki-results-cache-0
- calico-apiserver (x2), calico-kube-controllers, calico-typha
- coredns (x2)
- metrics-server
- alloy

→ Resource contention when deploying app Pods, new Pods go Pending

Detailed Analysis#

1. Pending Pods#

kubectl get pods -n dev-webs

New version Pods are Pending, previous version remains Running

2. Scheduling failure reason#

Warning  FailedScheduling  default-scheduler
0/2 nodes are available:
  1 node(s) didn't match Pod's node affinity/selector,
  1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }
  • mini-gmk (control-plane): has taint, scheduling blocked
  • mini-might (worker): missing role=app label, nodeSelector mismatch

3. Over-allocated resources#

Allocated resources:
  memory: 16076Mi (57%) requests, 28295Mi (101%) limits
  • replicas set to 2 causes resource shortage
  • Made worse by old and new Pod versions coexisting during RollingUpdate

Fix#

1. Reduce replicas (dev environment)#

# values-auth-guard.yaml, values-queue.yaml, values-seat.yaml
replicaCount: 2 → 1

2. Add node label#

kubectl label node mini-might role=app
kubectl taint nodes mini-gmk node-role.kubernetes.io/control-plane:NoSchedule-

4. Separate nodes using NodeSelector#

CategorynodeSelector
Apps (dev-webs, dev-ai)kubernetes.io/hostname: mini-might
Infra (istio, monitoring, loki)kubernetes.io/hostname: mini-gmk

Taint vs NodeSelector Comparison#

MethodProsCons
Taint/TolerationEnforced, prevents mistakesComplex, requires toleration on every Pod
NodeSelectorSimple, explicitPods can land anywhere if nodeSelector is missing

Conclusion#

  • Remove the Taint: reduces management complexity
  • Use NodeSelector only: simpler and more intuitive
  • Explicitly separate apps to the worker node and infra to the control-plane node

Modified Files (303-goormgb-k8s-helm)#

dev/values/apps/#

  • values-auth-guard.yaml: replicaCount 2→1
  • values-queue.yaml: replicaCount 2→1
  • values-seat.yaml: replicaCount 2→1

dev/values/monitoring/#

  • values-loki.yaml: added nodeSelector

Useful Commands#

# Check taints
kubectl describe node mini-gmk | grep Taint

# Check nodeSelector mismatch
kubectl describe pod <pending-pod> | grep -A 5 Events

# Check node labels
kubectl get nodes --show-labels

# Reschedule infra Pods
kubectl delete pod -n istio-system -l app=istiod