Assigning Node Metadata to Pods

How to assign Node metadata like labels and annotations to Pods.

If you’re running Kubernetes in production, especially in a public cloud, where a single cluster may span multiple availability zones, chances are you’re configuring workloads with some awareness of your topology. Kubernetes has a few mechanisms to support zone awareness, but one common use case is how to propagate certain Node metadata, such as labels or annotations, to Pods to assist with this awareness. In this blog, we’ll go into specifics of how Pod scheduling really works and share some tips for how Kyverno can mutate Pods to add Node metadata like labels. Even if you’re not a Kyverno user, you’ll most likely learn something you didn’t know about Kubernetes.

When deploying Pod controllers with multiple replicas to a single cluster which has Nodes in different geographical areas, high-level organizations like “zones” or “availability zones” are important failure domains. You typically want to spread those replicas out amongst those zones so failure of any one does not take out the application. Kubernetes already has several methods for working with topologies like zones including spread constraints with some very recent enhancements in beta. The label topology.kubernetes.io/zone is a common Kubernetes label to denote zone information for Nodes. This can be set automatically when using cloud provider integration or manually, for example when running on-premises and building your own concept of zones.

Having the Node’s zone information available to the Pods running on it can be important, for example so Pods only respond to requests from that same zone and avoid making cross-region calls which can result in potentially expensive egress charges. And zone information aside, there can be a whole host of other information about the Node that a Pod would like to receive such as governance labels you apply, custom annotations, etc. The challenge is how to get this information because you don’t know on what Node a Pod will be scheduled.

Mutation of Pods is no great feat and Kyverno was one of the first admission controllers to offer it. Making calls to the Kubernetes API server is also well known and Kyverno does that too. But, as those who have tried this before and failed, you don’t know what to ask for when you make that call. This is because the Kubernetes scheduler is the one to make the decision on which Node will receive the Pod. And understanding how and where the scheduler fits into the Pod lifecycle is both nuanced and not well documented, even in the official docs. So, without further ado, we’d like to present the complete(r) guide to how Pod scheduling actually works because this will be critical in understanding both why this is problematic and how you can do it with Kyverno. Pay attention to the step numbers on the sequence diagram as we’ll reference these in the further discussion.

Pod Scheduling in Kubernetes

Thanks to the magic of Mermaid for enabling a diagram like this written out of pure Markdown!

sequenceDiagram
    autonumber
    User->>API Server: Create Pod "mypod"
    API Server->>etcd: Persist Pod
    Note over API Server,etcd: CREATE operation for Pod
    Note right of etcd: Status is "Pending"
    activate Scheduler
    Scheduler-->>etcd: See Pod needs scheduling
    Scheduler-->>Scheduler: Determine optimal node
    Scheduler->>API Server: Bind "mypod" to node "mynode01"
    Note over Scheduler,API Server: CREATE operation for Pod/binding
    deactivate Scheduler
    API Server->>etcd: Set nodeName in "mypod" to "mynode01"
    activate Kubelet on mynode01
    Kubelet on mynode01-->>etcd: See new Pod to run
    Kubelet on mynode01->>Kubelet on mynode01: Run "mypod"
    deactivate Kubelet on mynode01
    Note right of Kubelet on mynode01: Status is "Running"

Ok, with this sequence diagram above, let’s break this down. Refer to the step numbers for the matching description.

  1. A user sends a request to create a Pod called “mypod”. The assumption here is they are not statically defining which Node should be used via spec.nodeName.
  2. The API server receives this request and performs a CREATE operation on a Pod. This gets persisted to etcd. Importantly, when this happens, the Node is NOT yet known since the scheduler hasn’t kicked in. When it creates this Pod in etcd, the status is listed as “Pending”.
  3. Now the scheduler enters the picture. It sees there is a Pod which has been persisted but hasn’t been scheduled so it leaps into action.
  4. Using its (very sophisticated) internal logic, it determines the optimal Node. In this case, it selected “mynode01”.
  5. Once the target Node has been selected, it performs the binding. How it does this is by performing another CREATE but this time for a subresource called Pod/binding. We’ll talk more about this soon. This binding is nothing more than an instruction to the API server which says “bind mypod to Node mynode01.”
  6. The API server receives this Pod/binding and reacts by setting the spec.nodeName field in mypod (which was persisted in step 2) to mynode01. Very importantly, when it does this it DOES NOT result in any sort of UPDATE operation. It’s more or less “transparent”.
  7. Kubelet running on mynode01 is looking solely for Pods which have been assigned to it. It sees a new Pod pop up on its radar.
  8. Kubelet then runs the Pod by pulling the container images and starting everything up. Only now is when the status for that Pod enters the “Running” phase.

Note that in the above description and diagram, the CREATE and UPDATE operations refer to the context where admission controllers, like Kyverno, participate and not the actual verbs from an RBAC perspective.

Pod Bindings

If you refer back to step 5 you’ll notice the binding event. In this step, the scheduler informs the API server which Node should run the Pod. It does this by what are called subresources to a Pod. A subresource is essentially a way to “carve” out sections of a parent resource in order to control RBAC. For example /status and /exec are well-known other subresources to Pods. These control specific areas of a Pod, for example the .status object and Pod exec operations, respectively. A /binding is a similar albeit lesser-known subresource only sent by the scheduler.

Here’s an example binding taken from a K3d (K3s) cluster. Notice here that the Pod definition is absent. The meat of this subresource is under object.target which is where the scheduler defines the kind of resource (Node in this case) and the name of that resource (k3d-kyv1111-server-0 as shown here).

 1uid: 21fb3d8e-b9c9-42fe-a987-d4374e74a084
 2kind:
 3  group: ""
 4  version: v1
 5  kind: Binding
 6resource:
 7  group: ""
 8  version: v1
 9  resource: pods
10subResource: binding
11requestKind:
12  group: ""
13  version: v1
14  kind: Binding
15requestResource:
16  group: ""
17  version: v1
18  resource: pods
19requestSubResource: binding
20name: busybox
21namespace: default
22operation: CREATE
23userInfo:
24  username: system:kube-scheduler
25  groups:
26    - system:authenticated
27roles:
28  - kube-system:extension-apiserver-authentication-reader
29  - kube-system:system::leader-locking-kube-scheduler
30clusterRoles:
31  - system:basic-user
32  - system:discovery
33  - system:kube-scheduler
34  - system:public-info-viewer
35  - system:volume-scheduler
36object:
37  apiVersion: v1
38  kind: Binding
39  metadata:
40    creationTimestamp: null
41    managedFields:
42      - apiVersion: v1
43        fieldsType: FieldsV1
44        fieldsV1:
45          f:target: {}
46        manager: k3s
47        operation: Update
48        subresource: binding
49        time: "2024-02-18T14:36:37Z"
50    name: busybox
51    namespace: default
52    uid: fceccee4-4821-408a-b75b-44262392b93c
53  target:
54    kind: Node
55    name: k3d-kyv1111-server-0
56oldObject: null
57dryRun: false
58options:
59  apiVersion: meta.k8s.io/v1
60  kind: CreateOptions

Fortunately, these Pod/binding subresources can be sent to admission controllers like Kyverno. So now we have the Pod persisted, we can observe the binding which contains the Node name, then we can make a call to the API server to get information on that Node. Grand!

Not so fast. Before you click close on this blog, keep reading. It’s a little more complex.

Using Node Info in Pods

We know what we want and we know approximately how to get it now that bindings have been introduced. But how and where you use the Node information in a Pod matters when it comes to supplying it.

Typically, use cases which involve fetching Node information and presenting it to Pods require that the container(s) in the Pod know about it. Informing containers in Pods of Pod metadata is usually done using the downward API. In Kubernetes, the downward API is used to present information about the Pod and/or its surroundings to the containers within it. It can do this in two primary ways: environment variables and volumes. Using either, the containers now understand their environment without them having to go and figure it out on their own. This is a boon because it’s simpler for the app and it’s more secure. It’s also the only way, in many cases, to provide this sort of orientation to containers.

For example, you can use the downward API to tell a Pod’s containers about the name of the Pod in which they are running with an environment variable named POD_NAME.

1  env:
2    - name: POD_NAME
3      valueFrom:
4        fieldRef:
5          fieldPath: metadata.name

And you could do the same thing with a volume and its mount.

1  volumes:
2    - name: podname
3      downwardAPI:
4        items:
5          - path: podname
6            fieldRef:
7              fieldPath: metadata.name
1    volumeMounts:
2      - name: podname
3        mountPath: /etc/podinfo

In the case of the former, the containers will get an environment variable POD_NAME=mypod and in the latter will have a file available at /etc/podinfo/podname which contains the word mypod.

But here’s another important nuance which will be highlighted in the next section: which way you choose to consume this information can matter when fetching it from Nodes.

Once Kubelet launches containers, providing this information later does nothing and it won’t be available because the container has already started. When it comes to environment variables, these need to be defined before Kubelet begins its routine. Volumes are a bit less forgiving, especially for lengthier container pulls as it takes the Kubelet a little bit longer to establish all the connections. Keep this in mind because it matters whether or not you’ll be successful in mutating those Pods.

Mutations

Now comes the part you’ve been waiting for. We had to tee this section up in order for you to get the full picture. Let’s talk about finally performing the mutations. Make sure you’ve fully read and understood the previous sections or this will be confusing.

When assigning Node metadata to Pods in Kyverno, you have a couple different options. What exactly you want to use that Node information for in the Pod determines which method you’ll need.

Mutating Bindings

The first and arguably “best” option for slapping Node info on Pods is to mutate the Pod/binding subresource directly. If you refer back to the scheduling sequence diagram above, you’ll see that we mentioned these Pod/binding subresources are sent to admission controllers. Also recall from the example binding that this does NOT contain the full Pod representation. It’s really an entirely different thing. There is only one type of mutation that matters in this context you can perform on a binding and that is to write annotations.

Kubernetes has a “secret” function called setPodHostAndAnnotations() which will take any annotations on a binding and transfer them to the parent Pod resource. This only works for annotations which is why if you’ve tried to set labels you’ve found they just get dropped. If you are going to consume Node information via the downward API, either as an environment variable or volume, this is the recommended way on going about it since it’s the most reliable and the annotation is guaranteed to be written before the Pod is started.

The sample policy is available here and is shown below but updated with modifications to show assignment of the topology.kubernetes.io/zone annotation. This works since Kyverno 1.10.

 1apiVersion: kyverno.io/v2beta1
 2kind: ClusterPolicy
 3metadata:
 4  name: mutate-pod-binding
 5  annotations:
 6    pod-policies.kyverno.io/autogen-controllers: none
 7    policies.kyverno.io/title: Mutate Pod Binding
 8    policies.kyverno.io/category: Other
 9    policies.kyverno.io/subject: Pod
10    kyverno.io/kyverno-version: 1.10.0
11    policies.kyverno.io/minversion: 1.10.0
12    kyverno.io/kubernetes-version: "1.26"
13spec:
14  background: false
15  rules:
16    - name: project-foo
17      match:
18        any:
19        - resources:
20            kinds:
21            - Pod/binding
22      context:
23      - name: node
24        variable:
25          jmesPath: request.object.target.name
26          default: ''
27      - name: zone
28        apiCall:
29          urlPath: "/api/v1/nodes/{{node}}"
30          jmesPath: "metadata.labels.\"topology.kubernetes.io/zone\" || 'empty'"
31      mutate:
32        patchStrategicMerge:
33          metadata:
34            annotations:
35              topology.kubernetes.io/zone: "{{ zone }}"

Demo

Let’s demonstrate this in action. For this, we’ll use a simple K3d cluster with 3 nodes where one is control plane and the other two are worker nodes. Each of the two workers are in separate availability zones (abbreviated output of labels).

1$ kubectl get no --show-labels
2NAME                 STATUS   ROLES                  AGE     VERSION        LABELS
3k3d-kyv11-server-0   Ready    control-plane,master   3h27m   v1.27.4+k3s1   
4k3d-worker01-0       Ready    <none>                 13m     v1.27.4+k3s1   topology.kubernetes.io/zone=us-east-2a
5k3d-worker02-0       Ready    <none>                 13m     v1.27.4+k3s1   topology.kubernetes.io/zone=us-east-2b

Remove the resource filters as described previously and install the ClusterPolicy.

Let’s now create a Deployment of busybox with 4 replicas. We’ll use some nodeAffinity here to ensure all four Pods land on one of the two worker nodes.

 1apiVersion: apps/v1
 2kind: Deployment
 3metadata:
 4  name: busybox
 5  labels:
 6    app: busybox
 7spec:
 8  replicas: 4
 9  selector:
10    matchLabels:
11      app: busybox
12  template:
13    metadata:
14      labels:
15        app: busybox
16    spec:
17      automountServiceAccountToken: false
18      containers:
19      - image: busybox:latest
20        name: busybox
21        command:
22        - env
23        env:
24        - name: ZONE
25          valueFrom:
26            fieldRef:
27              fieldPath: metadata.annotations['topology.kubernetes.io/zone']
28      affinity:
29        nodeAffinity:
30          requiredDuringSchedulingIgnoredDuringExecution:
31            nodeSelectorTerms:
32            - matchExpressions:
33              - key: topology.kubernetes.io/zone
34                operator: Exists

Once this is created, we expect Kyverno to mutate the Pod/binding resource to add the topology.kubernetes.io/zone annotation to each Pod (not to the parent Deployment since we disabled rule auto-gen). We’re running the env program to simply write out all the environment variables. So we expect the Pods to be in a “Completed” state where we can inspect the logs and hopefully see the ZONE environment variable has been set with the value equaling the value of the topology.kubernetes.io/zone label on the parent Node.

Create the Deployment and let’s check the Pods. You can see two got scheduled on each worker.

1$ kubectl get po -o wide
2NAME                       READY   STATUS      RESTARTS     AGE    IP           NODE                 NOMINATED NODE   READINESS GATES
3busybox-6b869bf945-5hcn6   0/1     Completed   1 (3s ago)   4s     10.42.2.6    k3d-worker02-0       <none>           <none>
4busybox-6b869bf945-bcbxv   0/1     Completed   1 (3s ago)   4s     10.42.2.5    k3d-worker02-0       <none>           <none>
5busybox-6b869bf945-nb8sc   0/1     Completed   0            4s     10.42.1.4    k3d-worker01-0       <none>           <none>
6busybox-6b869bf945-zvg2m   0/1     Completed   0            4s     10.42.1.5    k3d-worker01-0       <none>           <none>

Get the logs for one of the Pods on worker01.

 1$ kubectl logs busybox-6b869bf945-nb8sc
 2PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
 3HOSTNAME=busybox-6b869bf945-nb8sc
 4ZONE=us-east-2a
 5KUBERNETES_SERVICE_PORT=443
 6KUBERNETES_SERVICE_PORT_HTTPS=443
 7KUBERNETES_PORT=tcp://10.43.0.1:443
 8KUBERNETES_PORT_443_TCP=tcp://10.43.0.1:443
 9KUBERNETES_PORT_443_TCP_PROTO=tcp
10KUBERNETES_PORT_443_TCP_PORT=443
11KUBERNETES_PORT_443_TCP_ADDR=10.43.0.1
12KUBERNETES_SERVICE_HOST=10.43.0.1
13HOME=/root

And again for one on worker02.

 1$ kubectl logs busybox-6b869bf945-5hcn6
 2PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
 3HOSTNAME=busybox-6b869bf945-5hcn6
 4ZONE=us-east-2b
 5KUBERNETES_PORT=tcp://10.43.0.1:443
 6KUBERNETES_PORT_443_TCP=tcp://10.43.0.1:443
 7KUBERNETES_PORT_443_TCP_PROTO=tcp
 8KUBERNETES_PORT_443_TCP_PORT=443
 9KUBERNETES_PORT_443_TCP_ADDR=10.43.0.1
10KUBERNETES_SERVICE_HOST=10.43.0.1
11KUBERNETES_SERVICE_PORT=443
12KUBERNETES_SERVICE_PORT_HTTPS=443
13HOME=/root

As you can clearly see, our mutation worked and the Node’s label was fetched, written to an annotation on the Pod/binding subresource, was merged into the Pod’s annotations, and this was projected as an environment variable into the busybox containers.

Mutating Pods

The second method for reflecting Node information on Pods is to mutate the Pod directly. If, for example, you absolutely must write labels to the Pod then you cannot mutate the Pod/binding resource as mentioned in the previous section. You must mutate the Pod directly. The same goes for any other type of mutation to the Pod.

The kicker here is that this style of mutation, what’s known as a “mutate existing” rule, is an asynchronous process. Why is this important? Since it’s async there are no guarantees as to when this mutation will be performed. It could be within 50ms of receiving the binding and it could be in a second or more depending on load and a variety of other factors. And, chances are, that mutation could happen after the Pod has already started. Once the Pod has started, any environment variables or volumes have already been set and cannot be influenced. So then, this method should really only be used if you don’t care about consuming this Node information in the Pod’s containers directly. For example, if you just want the Pod to have labels from its parent Node for purposes of reporting or costing, then this method is fine.

An example on how to perform a mutation on an existing Pod in response to a binding is shown below. But first, you’ll probably need to grant the Kyverno background controller (the one responsible for all generate and “mutate existing” rules) some permissions it doesn’t have since we’re now changing existing Pods.

 1apiVersion: rbac.authorization.k8s.io/v1
 2kind: ClusterRole
 3metadata:
 4  labels:
 5    app.kubernetes.io/component: background-controller
 6    app.kubernetes.io/instance: kyverno
 7    app.kubernetes.io/part-of: kyverno
 8  name: kyverno:update-pods
 9rules:
10- apiGroups:
11  - ""
12  resources:
13  - pods
14  verbs:
15  - update
 1apiVersion: kyverno.io/v2beta1
 2kind: ClusterPolicy
 3metadata:
 4  name: add-node-labels-pod
 5  annotations:
 6    pod-policies.kyverno.io/autogen-controllers: none
 7spec:
 8  rules:
 9    - name: add-topology-labels
10      match:
11        any:
12        - resources:
13            kinds:
14            - Pod/binding
15      context:
16      - name: node
17        variable:
18          jmesPath: request.object.target.name
19          default: ''
20      - name: ZoneLabel
21        apiCall:
22          urlPath: "/api/v1/nodes/{{node}}"
23          jmesPath: "metadata.labels.\"topology.kubernetes.io/zone\" || 'empty'"
24      mutate:
25        targets:
26        - apiVersion: v1
27          kind: Pod
28          name: "{{ request.object.metadata.name }}"
29          namespace: "{{ request.object.metadata.namespace }}"
30        patchStrategicMerge:
31          metadata:
32            labels: 
33              # https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone
34              topology.kubernetes.io/zone: "{{ ZoneLabel }}"

Closing

This was a lot of information but hopefully you learned something, even if you aren’t a Kyverno user. To sum this blog post up, if you want to consume Node info in a Pod at runtime via environment variables or volumes, you should be mutating the Pod/binding subresource. Only annotations may be directly added to a Pod/binding. For all other use cases you should be mutating the existing Pod. Both can also be combined if need be. Example policies for both are provided but before selecting one you should know what they do and the result you can expect.