Workload Placement
Workload placement happens in two stages (1) filtering
- which excludes any unsuitable nodes then (2) scoring
- which ranks the remaining nodes to find the best fit.
Adding Labels To Nodes
This article assumes you are familiar with adding labels to nodes. See this article for more.
Taint
Taints are a special kind of label with a key-value pair, but it tells the scheduler that a particular node is different. For example, the master
taint is applied to control plane nodes by default (so your applications will not get scheduled on this important node).
You can use taint
to record relevant attributes about nodes, like the type of hardware. When you add a taint, workloads will not be scheduled on that node unless you add a matching toleration
to the workload.
For example, nothing will be scheduled if you add this taint to all nodes! Note that tainting doesn’t impact existing workloads, only future ones.
kubectl taint nodes --all kiamol-disk=hdd:NoSchedule
This is how you would add the toleration
to a workload:
spec: # The Pod spec in a Deployment
containers:
- name: sleep
image: kiamol/ch03-sleep
tolerations: # Lists taints this Pod is happy with
- key: "kiamol-disk" # The key, value, and effect all need
operator: "Equal" # to match the taint on the node.
value: "hdd"
effect: "NoSchedule"
The effect
can be these three types: 1. NoSchedule
- The Pod will not be scheduled on the node. 2. PreferNoSchedule
- The scheduler will try to avoid scheduling the Pod on the node. 3. NoExecute
- The Pod will not be scheduled on the node and any existing Pods on the node will be evicted. This taint is retroactive, meaning that it will effect existing Pods as well as new ones (this is different to the other two).
See this article for more info.
You can add a taint label to a node like this:
% kubectl taint nodes node1 key1=value1:NoSchedule
#remove the taint like This
% kubectl taint nodes node1 key1=value1:NoSchedule-
Taints are only for negative associations - you can’t use them to say, “this node is good for this workload”. For that, you need to use nodeSelector
or nodeAffinity
. You would not use a taint so say a workload should run on a GPU node, for example.
Node Selector
This is an example of using NodeSelector
:
spec:
containers:
- name: sleep
image: kiamol/ch03-sleep
tolerations: # The Pod tolerates nodes
- key: "kiamol-disk" # with the hdd taint.
operator: "Equal"
value: "hdd"
effect: "NoSchedule"
nodeSelector: # The Pod will run only on nodes
kubernetes.io/arch: zxSpectrum # that match this CPU type.
The arch
example are automatically set by Kubernetes on each node. For example, on my laptop if I do kl get nodes -o yaml
it will have the key,value pair architecture: arm64
under nodeInfo
.
Node selectors ensure that apps run only on nodes with specific label values, but you usually want some more flexibility than a straight equality match. A finer level of control comes with affinity and antiaffinity.
Here is another example of using nodeSelector
:
First, label your nodes:
# see list of nodes w/names
% kubectl get nodes --show-labels
# apply a label to a node
% kubectl label nodes <your-node-name> disktype=ssd
Then, add the nodeSelector
to your config:
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
disktype: ssd
Affinity
Unlike taint this is a positive association between a pod and a node. Affinity uses a node selector but with a match expression rather than equality. There is two kinds:
requiredDuringSchedulingIgnoredDuringExecution
: The scheduler can’t schedule the Pod unless the rule is met. This functions like nodeSelector, but with a more expressive syntax.preferredDuringSchedulingIgnoredDuringExecution
: The scheduler will try to meet the rule. If a matching node is not available, the Pod will still be scheduled.
You can constrain a Pod using labels on other Pods running on the node (or other topological domain), instead of just node labels, which allows you to define rules for which Pods can be co-located on a node.
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- antarctica-east1
- antarctica-west1
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key
operator: In
values:
- another-node-label-value
containers:
- name: with-node-affinity
image: registry.k8s.io/pause:2.0
In this example, the following rules apply:
- The node must have a label with the key
topology.kubernetes.io/zone
and the value of that label must be eitherantarctica-east1
orantarctica-west1
. - The node preferably has a label with the key
another-node-label-key
and the valueanother-node-label-value
.
The operator used above is In,
but it can also be In
, NotIn
, Exists
, DoesNotExist
, Gt
and Lt
.
The NotIn
and DoesNotExist
allow you to define antiaffinity
rules. For example, you could say “don’t schedule this pod on a node that already has a pod with this label”. You could also use taints for this as well.
Affinity Weight
For preferredDuringSchedulingIgnoredDuringExecution
scheduling, you can set a weight b/w 1-100. The scheduler adds all the weights of all the preferred rules and adds that to the score when making a scheduling decision.
Example of two different weights:
apiVersion: v1
kind: Pod
metadata:
name: with-affinity-anti-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux # The Node MUST have the label `kubernetes.io/os=linux`
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: label-1
operator: In
values:
- key-1
- weight: 50
preference:
matchExpressions:
- key: label-2
operator: In
values:
- key-2
containers:
- name: with-node-affinity
image: registry.k8s.io/pause:2.0
Inter-pod affinity and anti-affinity
So you can either have pods run together on same node or make sure they run on seperate nodes
See these docs if you need this. Maybe you want GPU workloads to run separately, for example.
The affinity or anti-affinity can be scoped to a node, a zone, a region, etc. To set the scope you set the topologyKey
to the appropriate label. For example, if you want to run pods on the same zone, you would set topologyKey
to topology.kubernetes.io/zone
.
This prevents multiple replicas with the label app=store
on the same node:
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-cache
spec:
selector:
matchLabels:
app: store
replicas: 3
template:
metadata:
labels:
app: store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: redis-server
image: redis:3.2-alpine
From the docs
Topology
Topology refers to physical layout of your cluster. The hostname
label is always present and is unique per node. Cloud providers add region
and zone
labels. A topology
key sets the level where the affinity applies. For example, hostname
would force all pods onto the same node, zone
would force all pods onto the same zone, etc. Antiaffinity
works the same, where you can keep nodes from being scheduled on the same node, zone, etc.
affinity: # Affinity rules for Pods use
podAffinity: # the same spec as node affinity.
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions: # This looks for the app and
- key: app # component labels to match.
operator: In
values:
- numbers
- key: component
operator: In
values:
- api
topologyKey: "kubernetes.io/hostname"
This is another example, where the AntiAffinity
rule says “don’t schedule this pod on a node within the same zone as one or more pods with the label `security=S2”:
apiVersion: v1
kind: Pod
metadata:
name: with-pod-affinity
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S1
topologyKey: topology.kubernetes.io/zone
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: topology.kubernetes.io/zone
containers:
- name: with-pod-affinity
image: registry.k8s.io/pause:2.0
from the docs
Read this article for more info.