Optimyze and Elastic have joined forces 🤝

The mystery of kubelet eating CPU and IOPS

Thomas Dullien
Sep 1, 2021 @halvarflake

System-wide continuous profiling has a habit of turning up performance murder mysteries: The data tells you that something is off in a rather unexpected manner, and an investigation then reveals surprising insights about the deployed systems.

In this blog post, we are going to investigate a mystery involving kubelet - the Kubernetes process responsible for Kubernetes housekeeping.

Our investigation starts with a deployment running Prodfiler on a Kubernetes cluster of 100+ nodes, and running various services on top. To everybody’s great surprise, we’d occasionally see individual machines exhibiting “jumps” in kubelet CPU utilization: kubelet CPU consumption would rise to 10x the normal level and persist over hours or days, consuming almost an entire core.

Screen
Sudden appearance of 12%+ CPU consumption by `kubelet` on a multicore distributed DB server

The first thing we do in such a scenario is to pull a flamegraph for the corresponding service and time period:

Screen
A flamegraph from the offending `kubelet`

To everybody’s great surprise, the increase in CPU consumption from kubelet was almost entirely caused by path/filepath.Walk() code. Pulling a differential flamegraph from before and after the jump in CPU utilization confirmed that the increase was almost entirely from the directory walking (and the corresponding GC pressure).

Screen
Differential flamegraph comparing `kubelet` pre- and post-jump: Grey code is unchanged in CPU consumption

In the above flamegraph, the grey areas show the parts of the code that did not have increased CPU consumption, everything in red shows increased consumption for an equivalent timeframe.

The biggest increase is in cadvisor/container/common.(*realFsHandler).trackUsage(): In the time period of one hour, we used to get ~16 samples involving this stack frame; after the jump in CPU utilization, we received almost 15k samples. Clearly something is going wrong.

Digging into the relevant source code for cadvisor, we land at this snippet:

It turns out that cadvisor, in order to track the disk usage of local containers, performs a recursive directory walk of the r/w layer of all containers every minute.

Further investigation revealed that some of the containers that were deployed would, under certain circumstances, create rather sizeable directory trees on the r/w layer – small in individual file size, but large in terms of overall filesystem tree (number of directories and number of small files).

In pessimal cases, this can lead to a triple-whammy of resource consumption:

  1. kubelet/cadvisor can end up eating significant CPU walking the directory tree repeatedly
  2. Depending on other factors, the periodic file system walk can start consuming things such as EBS credits, deteriorating the IOPS for the underlying volume.
  3. Performing the directory walk leads to a number of short-lived allocations which then end up creating more work for the garbage collector (in our example, the number of GC-related samples went from ~270 per hour to ~1300 per hour).

The issue is not new - in the past, cadvisor had the habit of running du on every container at regular intervals, and this issue has a pretty thorough discussion about the fix of only iterating over the r/w layer.

Now, it is not a secret that creating large numbers of files and directories in the r/w layer of containers is not a good practice (and usually easily avoided) – that said, not every developer is aware of the detrimental effects this can have outside of their own container.

Once the issue was diagnosed, it was fixed relatively easily by reducing the size and depth of the directory structure the container created on startup, and kubelet CPU (and iops) consumption reverted back to the expected norm.

If you want to reproduce the issue on your own cluster, you can use the following source code and Dockerfile:

If this blog post made you curious about continuously profiling entire Kubernetes clusters (or other large-scale deployments), visit https://run.prodfiler.com/signup and get started :-)