newspaint

Documenting Problems That Were Difficult To Find The Answer To

Category Archives: Kubernetes

Kubernetes Services clusterIP and externalIPs with IPTables

A basic ClusterIP type service might have the following elements to its description:

me@myhost:~$ kubectl get service my-service -o json
{
    "apiVersion": "v1",
    "kind": "Service",
    ...
    "spec": {
        "clusterIP": "10.41.0.123",
        "ports": [
            {
                "name": "my-service",
                "port": 6556,
                "protocol": "TCP",
                "targetPort": 6556
            }
        ],
        ...
        "type": "ClusterIP"
    },
    ...
}

How does this clusterIP actually work with iptables?

Here’s the secret: the clusterIP (in this case 10.41.0.123) does not belong to any interface! It is a fiction, an illusion.

Let’s say we have 3 worker nodes (a node being a server, real or virtual, that hosts pods) and each worker node has a pod for this service running on it. That might look a little like the following:

Nodes and pods for service

Nodes and pods for service

Nowhere on this picture is the clusterIP (10.41.0.123).

So how does a packet destined for 10.41.0.123 end up at a pod for this service?

The answer is that every single node has a set of iptables NAT table rules that intercept packets destined for the clusterIP and re-write the destination address to that of a pod assigned to the service.

Let’s begin by inspecting the NAT (network address translation) table of iptables on a Kubernetes worker node (only those parts that are of interest will be shown here):

me@myhost:~$ sudo iptables -L -v -n -t nat |less
Chain PREROUTING (policy ACCEPT 19 packets, 1524 bytes)
 pkts bytes target     prot opt in     out     source               destination 
 377M   30G KUBE-SERVICES  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain OUTPUT (policy ACCEPT 4 packets, 285 bytes)
 pkts bytes target     prot opt in     out     source               destination
  30M 1834M KUBE-SERVICES  all  --  *      *       0.0.0.0/0            0.0.0.0/
0            /* kubernetes service portals */

Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source               destination 
    0     0 KUBE-SVC-W3DXGYHAQT4XZCGH  tcp  --  *      *       0.0.0.0/0            10.41.0.123          /* my-service: cluster IP */ tcp dpt:6556

Let’s explain what this does. When a packet first arrives at a node on any interface it enters the PREROUTING chain. If a packet originates from the node itself it enters the OUTPUT chain instead. In both cases the packet has not yet been processed by the routing table (see also diagram at this link).

Regardless of whether the packet enters the PREROUTING or OUTPUT chains it gets sent to the KUBE-SERVICES chain. And this chain looks out for service IPs (and ports).

In the case of a TCP packet destined to IP address 10.41.0.123 and port 6556 it then gets sent to the KUBE-SVC-W3DXGYHAQT4XZCGH chain.

What does this KUBE-SVC-W3DXGYHAQT4XZCGH chain look like?

Chain KUBE-SVC-W3DXGYHAQT4XZCGH (1 references)
 pkts bytes target     prot opt in     out     source               destination 
    0     0 KUBE-SEP-TYCQ62MBFETG3WXG  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* my-service:my-service */ statistic mode random probability 0.33332999982
    0     0 KUBE-SEP-GWC5HBFAKQJPM7YT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* my-service:my-service */ statistic mode random probability 0.50000000000
    0     0 KUBE-SEP-PM3FH7DWZGHX3JK2  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* my-service:my-service */

This chain distributes incoming packets to different pods based on chance. Each rule assigns 1/n probability of matching where n is the number of remaining pods (including the current one) left that can be assigned. So if there are 4 pods the first rule assigns 1/4 probability, the second rule (third-last) assigns 1/3 probability, the third rule (second-last) assigns 1/2 probability, and the last rule assigns 1/1 probability.

Let’s now take a closer look at the KUBE-SEP-TYCQ62MBFETG3WXG chain which does the actual DNAT (destination address re-writing) to a particular pod:

Chain KUBE-SEP-TYCQ62MBFETG3WXG (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 DNAT       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* my-service:my-service */ tcp to:10.22.1.101:6556

What this rule does is re-write the destination address of any packet sent to this chain with the IP address of the pod (in this case my-service-a36d22d87-ge3pk, or 10.22.1.101, from the picture above) and the port of the service.

Once the packet has a different destination (the IP address of the pod) it can be routed normally to the pod.

This is how a clusterIP address doesn’t really exist anywhere – yet any packet destined for the clusterIP address that arrives on a worker node will still be delivered to an appropriate pod.

Packet for clusterIP gets destination changed to pod IP

Packet for clusterIP gets destination changed to pod IP

ExternalIPs

So why have externalIPs?

A clusterIP, in theory, is only known to the Kubernetes cluster. It should be an address local to the cluster and, possibly or even probably, automatically assigned. Which is fine for one pod (say, a web server) in a Kubernetes cluster talking to a different service (say, a database) on the same Kubernetes cluster.

You may wish, however, to expose a manually assigned address so that an external (non-Kubernetes) router will forward packets for that address to a node in the cluster.

The mechanics are almost exactly the same for externalIPs as a clusterIP – iptables nat table rules are added and, in fact, in the KUBE-SERVICES chain rules that match the externalIPs simply send packets to exactly the same SVC chain as for the clusterIP for that same service.

A service definition looks very similar with externalIPs:

me@myhost:~$ kubectl get service my-service -o json
{
    "apiVersion": "v1",
    "kind": "Service",
    ...
    "spec": {
        "clusterIP": "10.41.0.123",
        "externalIPs": [
            "192.0.2.16"
        ],
        ...
        "type": "ClusterIP"
    },
    ...
}

Conntrack

So you’ve made a TCP connection to a clusterIP which was DNAT’d (destination network address translated). It was also masqueraded (which means when the packet was forwarded the source address was re-written to the IP address of the node). The masquerade rule is there in the iptables nat table (it is an exercise for the reader to find it, hint, check the POSTROUTING and KUBE-POSTROUTING chains).

The pod receives a packet from the forwarding node destined to itself; so how does a reply from the destination pod make its way back to the original sender?

The answer is conntrack.

Let’s say another pod (from a different service) on Node A with IP address 10.22.1.87 makes a TCP connection to the clusterIP 10.41.0.123 and port 6556 and that gets DNAT’d to pod my-service-a36d22d87-e2kja with IP address 10.22.1.122 on node B. Note that the forwarded packet may be masqueraded with a source IP of Node A’s IP address (172.33.22.14) instead of the originating pod’s IP address (10.22.1.87). We might get a conntrack entry like the following:

me@myhost:~$ sudo cat /proc/net/nf_conntrack # or conntrack -L -n
tcp      6 113 ESTABLISHED src=10.22.1.87 dst=10.41.0.123 sport=58366 dport=6556 src=10.22.1.122 dst=10.22.1.87 sport=6556 dport=58366 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1

This conntrack entry stays alive for a short period of time (usually less than a minute while inactive) and knows how to re-map the source address/port (and destination address/port if necessary) of any replies from the host to which the original packet was DNAT’d (and masqueraded).

How to Get a List of All Pods for a Given Service in Kubernetes

Let’s say you have a particular service name and you want to know the names of all the pods for that service.

Start by dumping the YAML configuration of the service to find the “selector”:

me@myhost:~ $ kubectl get service my-service-my -o yaml
...
spec:
  ...
  selector:
    ...
    app: my-service
...

Now you can use that selector to construct a pod query for just that service:

me@myhost:~ $ kubectl get pods --selector app=my-service -o custom-columns=:metadata.name
my-service-my-a24eb5222-4qgx2
my-service-my-a24eb5222-8bgqf
my-service-my-a24eb5222-d4bh2
my-service-my-a24eb5222-hk2vj
my-service-my-a24eb5222-trmcc
my-service-my-a24eb5222-p34m8
my-service-my-a24eb5222-pn3qs
my-service-my-a24eb5222-rjtd6