How pod placements work in kubernetes

Understanding pod placement concepts in Kubernetes can be intimidating. In this post, we have tried to simplify it for beginner and intermediate-level platform engineers.

Saravanan Arumugam (Aswath)

•

April 27, 2023

•

10 minute read

Saravanan Arumugam (Aswath)

Authors:

No items found.

April 27, 2023

•

10 minute read

•

If you are reading this post, you likely already know what Kubernetes is and even how to use it for container orchestration.

Kubernetes uses a sophisticated scheduler called the "kube-scheduler" to determine where to place newly created pods within a cluster. The scheduler's primary goal is to ensure that workloads run efficiently and resources are utilised optimally.

A lot has been written about it really well.

But in this article, we will delve into the details of how Kubernetes determines where to place Pods and the various factors that influence the Pod placement decision. We will also discuss best practices for optimising Pod placement in Kubernetes to achieve better performance and resource utilisation.

Example Scenario

Throughout this document, we will be using a hypothetical scenario to illustrate the different aspects of the Pod Placement Algorithm in Kubernetes.

In this scenario, we have a Kubernetes cluster with three nodes, each with different capacity and constraints:

Node A: has 2 CPU cores and 4GB of RAM, but cannot run containers with privileged mode.
Node B: has 4 CPU cores and 8GB of RAM, but cannot run containers with GPUs.
Node C: has 8 CPU cores and 16GB of RAM, and can run any type of container.

We will be deploying a web application that consists of three pods: a frontend, a backend, and a database. Each pod has specific resource requests and limits, as well as certain constraints that need to be taken into account when scheduling them on the nodes.

Using this example, we will explore the different aspects of the Pod Placement Algorithm in Kubernetes, including the different placement strategies, node affinity and anti-affinity, taints and tolerations, and Pod Overhead.

The Pod Placement Algorithm

Kubernetes provides several options for placing Pods on Nodes based on the resources required by the Pod and the available resources on the Nodes. Kubernetes uses a node selection algorithm and a Pod scheduling algorithm to place Pods on Nodes. The node selection algorithm determines which Nodes are eligible to run a given Pod, while the Pod scheduling algorithm chooses a specific Node from the eligible Nodes.

Kubernetes provides several strategies for node selection. These include:

Label-based selection
Node affinity and anti-affinity
Taints and tolerations

Label-based selection

Label-based selection allows administrators to specify a set of labels that must be present on a Node for it to be eligible to run a given Pod. Pod manifests can specify nodeSelector rules that specify label requirements for the Nodes that are eligible to run the Pod.

In our below example, we used label-based selection to ensure that only Nodes with the label disk=ssd were eligible to run our nginx Pod:

	
  apiVersion: v1
  kind: Pod
  metadata:
  name: nginx-pod
  spec:
   containers:
   - name: nginx
  image: nginx
   nodeSelector:
    disk: ssd

Node affinity and anti-affinity

Node affinity and anti-affinity provide a more flexible way to express scheduling requirements than label-based selection. With node affinity and anti-affinity, administrators can specify more complex rules for node selection based on attributes of the Nodes themselves or their Pods.

Node affinity allows administrators to specify that Pods should be scheduled on Nodes that have certain labels or other attributes. Node anti-affinity allows administrators to specify that Pods should not be scheduled on Nodes that have certain labels or other attributes.

	
	apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  selector:
    matchLabels:
      app: web
  replicas: 3
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: frontend
        image: frontend-image:latest
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "1"
            memory: "2Gi"
        nodeSelector:
          disk: ssd
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: cpu
                  operator: In
                  values:
                  - "8"
                - key: memory
                  operator: In
                  values:
                  - "16Gi"
        tolerations:
        - key: "gpu"
          operator: "Exists"
          effect: "NoSchedule"
      - name: backend
        image: backend-image:latest
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "1"
            memory: "2Gi"
        nodeSelector:
          disk: ssd
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: cpu
                  operator: In
                  values:
                  - "4"
                - key: memory
                  operator: In
                  values:
                  - "8Gi"
        tolerations:
        - key: "privileged"
          operator: "Exists"
          effect: "NoSchedule"
      - name: database
        image: database-image:latest
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "1"
            memory: "2Gi"
        nodeSelector:
          disk: ssd
        affinity:
          nodeAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                - key: disk
                  operator: In
                  values:
                  - ssd
        tolerations:
        - key: "node"
          operator: "Exists"
          effect: "NoSchedule"

This example defines a Deployment for a web application consisting of three Pods: frontend, backend, and database.

Each Pod has resource requests and limits, and node affinity and anti-affinity rules to ensure they are scheduled on appropriate nodes.

The frontend and backend Pods have node affinity rules matching nodes with specific CPU and memory resources, and tolerations allowing them to be scheduled on nodes with certain constraints.

The database Pod has a preferred node affinity rule matching nodes with SSD disks and a toleration allowing it to be scheduled on nodes with certain constraints.

These rules optimise resource utilisation and ensure high availability of the web application.

Taints and tolerations

Taints and tolerations provide a way to repel / force Pods from Nodes that are not suitable for them. Taints are labels applied to Nodes that indicate that Pods should not be scheduled on them unless the Pods have a corresponding toleration.

Let's say we have a Kubernetes cluster with three nodes: node-1, node-2, and node-3. We want to ensure that node-1 is used exclusively for critical workloads and node-2 is reserved for GPU workloads. To achieve this, we can add taints to these nodes and specify tolerations in our Pod YAML file.

	
apiVersion: v1
kind: Pod
metadata:
  name: example-pod
spec:
  containers:
  - name: example-container
    image: nginx
  tolerations:
  - key: "workload"
    operator: "Equal"
    value: "critical"
    effect: "NoSchedule"
  - key: "gpu"
    operator: "Exists"
    effect: "NoSchedule"

In this example, the Pod has two tolerations specified. The first toleration matches the taint on node-1, with a key of "workload", value of "critical", and effect of "NoSchedule". This means that the Pod will be scheduled on node-1 only if no other node is available, as node-1 is reserved for critical workloads.

The second toleration matches the taint on node-2, with a key of "gpu" and effect of "NoSchedule". This means that the Pod will only be scheduled on node-2 if no other node is available and the Pod has a GPU requirement.

By using taints and tolerations, we can ensure that Pods are scheduled on nodes that meet their specific requirements and constraints, while also reserving certain nodes for specific workloads.

Pod Scheduling

Once the node selection process is completed, the pod scheduling process starts. Kubernetes uses the following scheduling mechanisms to place the pod on the selected node:

Node Affinity: A way to constrain pod placement by specifying which nodes a pod can be scheduled on, based on labels that are applied to the nodes.
Pod Affinity: A way to constrain pod placement by specifying rules that specify how pods should be placed relative to other pods.
Pod Anti-Affinity: A way to constrain pod placement by specifying rules that specify how pods should not be placed relative to other pods.

Node Affinity

Node Affinity allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node. There are two types of node affinity:

RequiredDuringSchedulingIgnoredDuringExecution: The pod can only be scheduled on nodes that have the specified label(s). If no nodes have the specified label(s), the pod remains unscheduled.
PreferredDuringSchedulingIgnoredDuringExecution: The pod is scheduled on nodes that have the specified label(s) if possible. However, if no nodes have the specified label(s), the pod will still be scheduled.

	
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: frontend
        image: frontend-image
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "256Mi"
        env:
        - name: FRONTEND_ENV_VAR
          value: "frontend"
      - name: backend
        image: backend-image
        resources:
          requests:
            cpu: "200m"
            memory: "256Mi"
          limits:
            cpu: "1"
            memory: "1Gi"
        env:
        - name: BACKEND_ENV_VAR
          value: "backend"
      - name: database
        image: database-image
        resources:
          requests:
            cpu: "100m"
            memory: "512Mi"
          limits:
            cpu: "500m"
            memory: "1Gi"
        env:
        - name: DATABASE_ENV_VAR
          value: "database"
      nodeSelector:
        disk: ssd
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node
                operator: In
                values:
                - database
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - frontend
                - backend
            topologyKey: "kubernetes.io/hostname"

In this example, we have added the Node Affinity rules to the Deployment YAML.

The requiredDuringSchedulingIgnoredDuringExecution rule for the database Pod specifies that the Pod must be scheduled on nodes with a label node whose value is database.

The podAntiAffinity rule specifies that the frontend and backend Pods should not be scheduled on the same node, by matching their labels against each other and using the topologyKey to ensure they are on different nodes.

These rules ensure that the Pods are scheduled on nodes that meet their specific requirements and constraints, optimizing resource utilization and ensuring high availability of the web application.

Pod Placement Strategies:

In Kubernetes, there are different strategies for placing Pods on nodes. The choice of strategy depends on the use case and the requirements of the application. Here are some of the Pod Placement Strategies supported by Kubernetes:

Spread-based Placement:

Spread-based placement ensures that Pods are evenly distributed across nodes to maximise availability and fault tolerance. This strategy is useful for applications that require high availability and cannot tolerate node failures. In this strategy, Kubernetes tries to place a new Pod on a node that has the fewest Pods of the same type (based on labels). This ensures that the Pods are spread across nodes as evenly as possible, reducing the impact of node failures.

	
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  strategy:
    type: Spread
    spreadConstraints:
    - maxSkew: 1
      topologyKey: "kubernetes.io/hostname"
  template:
    metadata:
      labels:
        app: web-app
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: disk
                operator: In
                values:
                - ssd
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - frontend
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: frontend
        image: my-frontend-image
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "1000m"
            memory: "1024Mi"
      - name: backend
        image: my-backend-image
        resources:
          requests:
            cpu: "1000m"
            memory: "1024Mi"
          limits:
            cpu: "2000m"
            memory: "2048Mi"
      - name: database
        image: my-database-image
        resources:
          requests:
            cpu: "500m"
            memory: "1024Mi"
          limits:
            cpu: "1000m"
            memory: "2048Mi"

In this example, we have defined a Deployment for a web application with a replica count of 3 and spread-based placement strategy. The strategy ensures that the Pods are evenly spread across the nodes based on the topology key "kubernetes.io/hostname". The spreadConstraints field specifies that the maximum skew between nodes should be 1.

The maxSkew field specifies the maximum allowed difference between the number of Pods running on any two nodes. In this case, we have set it to 1, which means that no node can have more than one additional Pod running than any other node.

The topologyKey field specifies the node label that should be used to determine the topology of the cluster. In this case, we have set it to "kubernetes.io/hostname" to ensure that the Pods are spread across different nodes.

Bin-packing Placement:

Bin-packing placement optimises resource utilisation by packing Pods as densely as possible on nodes. This strategy is useful for applications that require high resource utilisation and do not require high availability. In this strategy, Kubernetes tries to place a new Pod on a node that has the most available resources, such as CPU and memory.

	
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web-app
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: frontend
        image: my-frontend-image
        resources:
          requests:
            cpu: "500m"
            memory: "256Mi"
          limits:
            cpu: "1"
            memory: "512Mi"
        env:
        - name: ENVIRONMENT
          value: "production"
      - name: backend
        image: my-backend-image
        resources:
          requests:
            cpu: "500m"
            memory: "256Mi"
          limits:
            cpu: "1"
            memory: "512Mi"
        env:
        - name: ENVIRONMENT
          value: "production"
      - name: database
        image: my-database-image
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "1"
            memory: "1Gi"
        env:
        - name: ENVIRONMENT
          value: "production"

In this example, we define a Deployment for a web application that consists of three Pods: frontend, backend, and database. The Deployment uses the bin-packing placement strategy to pack Pods as densely as possible on nodes based on their resource utilization.

The frontend and backend Pods have specific resource requests and limits for CPU and memory, while the database Pod has a higher memory requirement. All three Pods have an environment variable set to "production".

The Deployment also uses pod anti-affinity to ensure that Pods are not scheduled on nodes that already have a Pod of the same app. The topology key used for the anti-affinity rule is "kubernetes.io/hostname", which means that Pods are spread across different nodes as much as possible.

Overall, this example demonstrates how bin-packing placement can be used to optimize resource utilization for a web application deployed on a Kubernetes cluster.

Custom Placement Strategies:

Kubernetes also allows users to define their own Pod Placement strategies using Kubernetes APIs or third-party tools. With custom placement strategies, users can define placement rules based on specific requirements, such as regulatory compliance, network locality, or hardware affinity.

To define a custom placement strategy, users can use Kubernetes APIs to specify placement rules based on node labels, node annotations, or other node metadata. Users can also use third-party tools, such as the Cluster API or the Kubernetes Resource Management API, to define custom placement strategies.

	
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for high priority Pods only."
---
apiVersion: v1
kind: Pod
metadata:
  name: custom-placement
spec:
  containers:
  - name: custom-container
    image: nginx
  priorityClassName: high-priority
  nodeSelector:
    disk: ssd
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - database
          topologyKey: kubernetes.io/hostname
        weight: 100
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - frontend
        topologyKey: kubernetes.io/hostname
  tolerations:
  - key: "gpu"
    operator: "Exists"
    effect: "NoSchedule"
  - key: "privileged"
    operator: "Exists"
    effect: "NoSchedule"

In this example, we define a custom placement strategy for a Pod that requires high priority and needs to be scheduled only on nodes with SSD disks. We also define podAffinity and podAntiAffinity rules that ensure the Pod is scheduled on nodes that match certain labels and topology keys, and tolerations that allow it to be scheduled on nodes with certain constraints.

Note that the priorityClassName field is defined in a separate PriorityClass resource and is referenced in the Pod definition. This allows for the definition of different priority levels for Pods in the same cluster.

Best practices for using pod placements effectively

The following best practices will help ensure Effective Pod Placement for efficient resource utilisation and high availability of applications in Kubernetes.

Use meaningful labels: When defining labels for your Kubernetes objects, use meaningful and consistent labels that reflect their purpose and role in your application. This will make it easier to search and filter objects, and to understand their relationships with other objects in your cluster.
Use selectors to group related objects: Use selectors to group related objects, such as Pods and Services, into logical units. This can help you manage and scale your application more easily, and can make it easier to perform rolling updates and other operations.
Use labels to identify specific objects: Use labels to identify specific objects, such as Pods and Services, and to associate them with other objects in your cluster. This can be particularly useful for load balancing and service discovery, as well as for monitoring and debugging.
Avoid using too many labels: While labels can be useful for organising and managing Kubernetes objects, it's important not to overuse them. Using too many labels can make it harder to manage and scale your application, and can increase the risk of errors and inconsistencies.
Use label selectors for efficient querying: When querying Kubernetes objects using label selectors, use efficient and specific queries to minimise the load on your cluster. This can help you avoid performance issues and ensure that your application runs smoothly.

By following these best practices for using labels and selectors in Kubernetes, you can ensure that your application is well-organized, easy to manage, and performs reliably.

Conclusion

Pod placement is a critical component of Kubernetes that enables users to optimise their applications' resource utilisation and performance. With various placement strategies such as spread-based and bin-packing placement and custom placement strategies, Kubernetes provides users with the flexibility and control to place Pods on nodes based on their specific requirements.

‍

Cloud & DevOps

Kubernetes