MySQL freezes for 2s every few hours: runc freeze timing out on a D-state process
MySQL freezes for 2s every few hours: runc freeze timing out on a D-state process
A production MySQL container was freezing for two seconds at a time, every few hours. The business team's monitoring showed a momentary stall in connection handling — no alerts firing, no errors logged, but real and reproducible. In the runc log, one line stood out:
freeze container before SetUnitProperties failed: unable to freeze
The timestamps lined up perfectly with the user-visible stalls. This is the sequel to our previous post (/dev/urandom intermittent EPERM) — same freeze → update → thaw chain, no EPERM this time, but a 2-second pseudo-hang in its place.
Root cause: a process in D state (uninterruptible sleep) inside the cgroup keeps the freezer state machine stuck in FREEZING forever, so runc spins for ~2 seconds before giving up.
TL;DR
The failure chain
- kubelet's CPU manager (1.17) polls every 10 seconds and calls CRI's
UpdateContainerResource— even when the cpuset hasn't changed. - The CRI call goes through containerd to runc. In systemd cgroup driver mode, runc must freeze the container before updating it, because systemd's dbus interface can't deny a single device; the only way to update device rules is "deny all + replay the allow list," and during that window any device access from the workload would return EPERM.
- runc writes
FROZENtofreezer.stateand polls. Per kernel semantics: if any process in the cgroup is in D state, the cgroup stays inFREEZINGforever. - runc spins ~1,000 iterations (≈2s), logs
Info, and proceeds with the update anyway. - During those 2 seconds the whole cgroup is frozen — so MySQL's healthy, non-D threads sit there doing nothing for 2 seconds. That's the user-visible "hang."
Fixes, in priority order
- Disable kubelet's CPU manager (if you don't need CPU pinning).
- Make CPU manager event-driven: only call
UpdateContainerResourceon container create or actual cpuset changes — not blindly every 10s. - Long-term: switch cgroup driver from systemd to cgroupfs. Skips the "deny-all + replay-allow" dance entirely.
- Kernel-side: make
FROZENskip D-state processes. Cleanest fix, deepest change.
Background
The user team flagged occasional MySQL hangs. Low frequency but real:
- 2023-08-15: one occurrence
- 2024-03-26: this one
Same symptom both times. Slightly different root cause — last time the systemd dbus call hung; this time the freezer is stuck on a D-state process. This post focuses on the 2024-03 case but threads the older one in.
Evidence
MySQL threads in D state
Sampling task state from the node during a hang: yes, some MySQL threads are in D (uninterruptible sleep) at the exact moment of the hang.
runc log
... freeze container before SetUnitProperties failed: unable to freeze 2024-03-25T19:27:59
... freeze container before SetUnitProperties failed: unable to freeze 2024-03-25T21:13:07
... freeze container before SetUnitProperties failed: unable to freeze 2024-03-26T00:12:55
... freeze container before SetUnitProperties failed: unable to freeze 2024-03-26T04:10:27
... freeze container before SetUnitProperties failed: unable to freeze 2024-03-26T07:23:59
... freeze container before SetUnitProperties failed: unable to freeze 2024-03-26T07:53:27
... freeze container before SetUnitProperties failed: unable to freeze 2024-03-26T16:51:38
User hangs ↔ runc freeze-failed logs ↔ MySQL D-state events: all three line up perfectly.
Two assumptions to rule out first
Did freeze cause the D state?
Initial theory: maybe runc's FROZEN operation pushed MySQL threads into D state.
Tested it: manually freeze a cgroup with freezer.state=FROZEN and watch the task state of contained processes — freezing does not change task state. Reading kernel/cgroup/cgroup-v1.c and kernel/freezer.c confirms it: freezer uses a separate task->frozen flag, orthogonal to task->state.
So MySQL going into D is its own thing — typically IO wait or lock contention — not caused by runc.
Is the hang caused by D state itself, or by the freeze?
Symptoms are reproducible, so we can A/B it.
I hooked runc with a wrapper script that skips all update calls:
#!/bin/bash
BIN_ROOT="/bin"
RUNC_NAME="runc"
function invoke_runc() {
local back="$BIN_ROOT/$RUNC_NAME.original"
[[ ! -f $back ]] && back="$BIN_ROOT/$RUNC_NAME"
# Skip all update calls
echo "$@" | grep -w update -q
[[ $? == 0 ]] && exit 0
$back --debug "$@"
}
invoke_runc "$@"
Deployed to affected nodes, backed up /bin/runc as runc.original, made all update calls no-ops.
The DBA team watched for a long stretch: no more hangs.
Conclusion: the hang is caused by runc's freeze step, not by the D state itself. The D-state process is just an amplifier that stretches freeze to 2 seconds.
Two ways runc's freeze can hurt you
doFreeze(FROZEN) roughly:
- Write
FROZENtofreezer.state. - Poll
freezer.stateevery 10ms:- if
FREEZING, keep waiting - until state is not
FREEZING, or 2s elapsed
- if
So runc gives freeze a 2-second window. After that it considers it failed, and proceeds with updateCgroup anyway. That last part is the key.
Mode 1: FROZEN succeeded, but the update is slow (2023-08)
doFreeze(FROZEN) returns successfully, but updateCgroup hangs in a systemd dbus call. The container is correctly frozen and stays frozen for the duration of the slow dbus call. User-visible hang = however slow dbus is.
Cause: slow systemd dbus.
Mode 2: FREEZING times out (2024-03, this incident)
doFreeze(FROZEN) never reaches FROZEN — freezer.state is stuck at FREEZING.
Confirmed with a kernel maintainer: if any process in the cgroup is in D state, the FROZEN flip never completes; the cgroup stays in FREEZING. Other interruptible processes are effectively frozen, but the cgroup-level state never advances.
runc can't get FROZEN, spins 2s, gives up. Meanwhile during those 2 seconds:
- Most non-D threads are already actually frozen (just the cgroup-level state field says
FREEZING) - All non-D workload threads are stalled
- runc then gives up freeze, proceeds with update — adds more time — total = the 2s+ hang the user sees
Cause: a D-state process in the cgroup keeps the freezer state machine stuck in FREEZING.
Why freeze in the first place?
Recap: systemd's dbus interface for updating device cgroups can't deny a single device. The only mechanism is deny all + replay allow list. So runc, in systemd cgroup driver mode, has to freeze the container around the update — otherwise the workload could touch a device during the deny → allow window and get EPERM.
The protection is reasonable. The runc comment is even clear about it:
// We have to freeze the container while systemd sets the cgroup settings.
// The reason for this is that systemd's application of DeviceAllow rules
// is done disruptively, resulting in spurious errors to common devices.
if err := m.Freeze(configs.Frozen); err != nil {
logrus.Infof("freeze container before SetUnitProperties failed: %v", err)
}
I'd previously hit this elsewhere in runc (opencontainers/runc#3804) — D-state processes causing freeze to fail and losing permission as a result.
But logging an Info and continuing with the update on freeze failure is itself a landmine:
- If
updatethen proceeds with the deny → allow, you can hit EPERM (the urandom incident). - If
updatefollows a 2s FREEZING timeout, the workload was just frozen for 2 seconds (this MySQL incident).
Silently downgrading when a protection mechanism fails is dangerous.
Where does the update come from? Kubelet CPU manager.
Why is update even being triggered periodically?
In Kubernetes 1.17 and earlier, cpu-manager calls UpdateContainerResource over CRI every 10 seconds, regardless of whether the cpuset actually changed. Each call triggers:
kubelet cpu-manager (10s tick)
│
▼
CRI: UpdateContainerResources
│
▼
containerd → runc update
│
├─► freeze (may time out at 2s)
├─► systemd dbus updates cgroup
└─► thaw
In normal conditions — no D-state processes, systemd healthy — the whole chain is invisible. But the moment the workload enters IO or lock-wait and gets a D-state process, the freeze step turns into a 2-second pause button pressed on the workload every 10 seconds.
Most ironic part: many workloads have no CPU pinning requirement at all. They never wanted CPU manager. But the "useless update every 10s" lands the landmine in their production anyway.
Fixes
Disable cpu-manager (fastest)
If your workload doesn't need CPU pinning, just disable kubelet's CPU manager. The trade-off: you lose static CPU allocation for Guaranteed pods. Pure CFS-scheduled workloads aren't affected.
Make cpu-manager event-driven (mid-term)
Only call UpdateContainerResource on container creation or actual cpuset change. This is what upstream has been moving toward — newer cpu-managers are event-driven. Older clusters can backport.
Switch cgroup driver to cgroupfs (long-term)
cgroupfs driver doesn't go through systemd dbus — no "deny all + replay allow" dance — so runc doesn't need to freeze for protection. All the landmines disappear. Cost: cgroup driver switch requires node drains and kubelet/containerd restarts. Long rollout window on large clusters.
Kernel: skip D-state processes in FROZEN (ideal)
The cleanest fix: let the freezer skip D-state processes when transitioning to FROZEN, instead of getting stuck in FREEZING. Needs kernel-level semantic changes to the freezer state machine.
Reflections
1. The same protection mechanism can hurt the workload in two different ways. Last time (urandom), freeze failed and protection broke, workload hit EPERM. This time (MySQL), freeze failed and runc spun for 2s, workload froze for 2s. The same code path ("log and proceed when freeze fails") is the landmine in both scenarios — one gives you EPERM, the other gives you a 2s hang. The right fix is: freeze failure must abort the update and surface an error.
2. Workloads entering D state isn't a bug — it's normal. MySQL doing fsync, waiting on I/O, or holding a kernel lock are all normal reasons to be in D. But the moment that happens, the freezer's behavior amplifies it. An otherwise-innocuous "scheduler periodically syncs cpuset" turns into a 2-second user-visible hang.
3. "Update every 10s with no change" is over-synchronization. Kubelet 1.17 cpu-manager's "reconcile everything every N seconds" model is supposed to be a no-op when nothing changes. But the lower-level runc + systemd implementation makes every "no-op" actually mutate the cgroup. When designing periodic reconciles: assume the real cost of each tick is much higher than you think.
4. Log level and impact severity are decoupled.
runc logs unable to freeze at Info. In reality, it means "the protection mechanism failed and the upcoming update will cause user-visible side effects." That should be at least Warn, ideally tied to retry logic. Production is full of "low log level, high impact" landmines like this.
5. Cross-layer problems have no owner. This issue touches kubelet (10s reconcile), CRI (passthrough Resources), containerd (request wrapping), runc (freeze + update), systemd (dbus + cgroup driver), the kernel (freezer state machine), and the workload (D-state). No single layer is "wrong" in isolation — but the composition gives users a 2-second hang. The hardest bugs are the ones where you don't know which issue tracker to file in.
Related links
- runc: D-state cgroup freeze workaround (#3804)
- kubelet cpu-manager evolution — moved to event-driven in later versions
- systemd cgroup.c —
DeviceAllowdeny+allow implementation
Translated from a 2024-03 post-mortem of a real MySQL incident in production. Internal hostnames, node names, and workload identifiers redacted.

Written by
Zoe
AI Infra Engineer · LLM Serving · GPU/RDMA · indie hacker, obsessed with shipping tools