/dev/urandom intermittently returns EPERM: device-permission jitter under runc + systemd cgroup driver

12 min read

/dev/urandom intermittently returns EPERM: device-permission jitter under runc + systemd cgroup driver

An ad-retrieval service in production was core-dumping. The error message:

libc++abi: terminating with uncaught exception of type std::__1::system_error:
  random_device failed to open /dev/urandom: Operation not permitted

The bizarre part: same image, same K8s cluster — stable on docker nodes, crashes guaranteed on containerd nodes. How does a device permission that should always be there briefly disappear?

This post is the post-mortem. Conclusion is very specific: under runc + systemd cgroup driver, every container resource update first does deny all, then replays the entire allow list — leaving the workload with a sub-second "no permissions" window.

TL;DR

Root cause

Under runc + systemd cgroup driver, device cgroup updates go like this:

  1. deny all first: write a to devices.deny — wipe all device permissions.
  2. Replay the whole allow list: for each device type (c 1:*, b 8:*, etc.), write back to devices.allow one by one.

The process is not atomic. After deny, before allow completes, any device access inside the container returns EPERM. The longer the allow list, the wider the window.

Three necessary conditions

  1. Runtime is containerd. containerd's UpdateContainerResources passes the full OCI Linux Resources (including devices) to runc; under systemd driver, runc goes through the deny → allow path. Docker, in the same CRI call, only forwards cpuset — no device update, no systemd rewrite.

  2. Freeze fails on the freezer cgroup. runc tries to freeze the container before update so device-rule churn isn't user-visible. But after retrying ~1000 times (≈2s) and failing, runc proceeds with the update without freezing — exposing the jitter window to the workload.

  3. Workload accesses devices at millisecond frequency. The window is sub-second; normal workloads touching /dev/urandom minute-by-minute never trip it. But this particular ad-retrieval service called std::random_device per request — basically reading urandom every millisecond.

All three simultaneously → "/dev/urandom intermittently returns EPERM, then abort."

Fix paths

  1. Workload side: reduce frequency of random-device access (intrusive).
  2. runc side: when freeze fails, don't proceed with the update. Merged upstream.
  3. Workload side (indirect): avoid huge piles of D-state processes so freezer actually works.

Walkthrough below.

Side-by-side: the contrasting nodes

Host Runtime Behavior
node-A containerd Service crashes guaranteed
node-B docker Service stable

Enter the container, check the device cgroup. The devices.list looks different on each side.

Affected (containerd):

b 8:* m       b 9:* m       ...
c 1:* m       c 4:* m       c 5:* m
c 7:* m       c 10:* m      ...
c 1:3 rwm     c 1:5 rwm     c 1:7 rwm
c 1:8 rwm     c 1:9 rwm     ...

Working (docker):

c 136:* rwm   c 5:2 rwm   c 5:1 rwm   c 5:0 rwm
c 1:9 rwm     c 1:8 rwm   c 1:7 rwm   c 1:5 rwm
c 1:3 rwm
b *:* m       c *:* m     c 10:200 rwm

/dev/urandom (c 1:9) is present and rwm on both. In steady state, both are correct. The problem must be in a transient state.

Confirming the window exists

Reproduction setup

Pick a containerd node, deploy a Deployment pinned to it, use the broken image:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: random-read-device-test
spec:
  replicas: 1
  template:
    spec:
      nodeName: debug-node
      containers:
      - image: internal-registry/ad/target_search_server:1.0.x
        name: random-read-device-test
        command: [sleep]
        args: [infinity]

Test program: a simple multi-threaded C++ that hammers /dev/urandom via std::random_device.

Enter the container's cgroup + namespace:

cid=$(crictl ps | grep random-read-device-test | awk '{print $1}')
pid=$(crictl inspect $cid | jq '.info.pid')
cg=$(crictl inspect $cid | jq -r '.info.runtimeSpec.linux.cgroupsPath')
cgpath=/$(echo $cg | awk -F '-' '{print $1 ".slice/" $1 "-" $2 ".slice"}')/$(echo $cg | awk -F ':' '{print $1 "/" $2 "-" $3 ".scope"}')

cgexec -g devices:$cgpath -g freezer:$cgpath nsenter -t $pid -m bash

Run the test:

[root@debug-node ~]# time ./read_urandom
threads=1
spawning read-urandom thread: 0......
libc++abi: terminating with uncaught exception of type std::__1::system_error:
  random_device failed to open /dev/urandom: Operation not permitted
Aborted

real    0m8.431s

Reproduces in 10 seconds. Reliably.

Experiment 1: stop kubelet — does it go away?

sudo systemctl stop kubelet

Re-run:

[root@debug-node ~]# time ./read_urandom
^C

real    1m24.155s

Ran for 1+ minute, no crash. The trigger is in some periodic kubelet → containerd action. The chain is kubelet → CRI → containerd → runc.

Experiment 2: trace cgroup writes

Watch cgroup_file_write with bpftrace:

kprobe:cgroup_file_write
{
    $of = ((struct kernfs_open_file *)arg0);
    $path = (struct path *)$of->file->f_path;
    $name = $path->dentry->d_name.name;
    $pname = (struct dentry *)$path->dentry->d_parent->d_name.name;
    printf("write cgroup by %s<%d>: %s/%s: %s\n",
        comm, pid, str($pname), str($name), str(arg1));
}

Output: every ~10 seconds, a write of devices.deny=a followed by a flurry of writes to devices.allow. The gap between those writes is the window the workload trips on.

Why docker doesn't fail: payload differences to runc

This is the critical difference.

containerd's update payload

{
  "devices": [{"allow": true, "access": "rwm"}],
  "memory": {"limit": 4294967296},
  "cpu": {"shares": 2048, "quota": 200000, "period": 100000, "cpus": "0-55"}
}

Contains a devices field — even if it's just an empty rule, runc enters the device update path.

docker's update payload

{
  "memory": {"limit": 0, "reservation": 0, "kernel": 0},
  "cpu": {"shares": 0, "quota": 0, "period": 0, "cpus": "0-63"},
  "blockIO": {"weight": 0}
}

Only cpu, memory, blockIO — no devices at all. runc has no reason to touch device rules.

Source diff:

containerd CRI (pkg/cri/server/container_update_resources.go):

func (c *criService) updateContainerResources(ctx context.Context, ...) (retErr error) {
    // Take current OCI spec, update wholesale, write back wholesale
    oldSpec, _ := cntr.Container.Spec(ctx)
    newSpec, _ := updateOCIResource(ctx, oldSpec, r, c.config)
    // ...
    if err := task.Update(ctx, containerd.WithResources(getResources(newSpec))); err != nil {
        return fmt.Errorf("failed to update resources: %w", err)
    }
}

containerd passes the whole OCI Linux Resources — devices included.

docker shim (pkg/kubelet/dockershim/docker_container.go):

func (ds *dockerService) UpdateContainerResources(...) (*runtimeapi.UpdateContainerResourcesResponse, error) {
    resources := r.Linux
    updateConfig := dockercontainer.UpdateConfig{
        Resources: dockercontainer.Resources{
            CPUPeriod:  resources.CpuPeriod,
            CPUQuota:   resources.CpuQuota,
            CPUShares:  resources.CpuShares,
            Memory:     resources.MemoryLimitInBytes,
            CpusetCpus: resources.CpusetCpus,
            CpusetMems: resources.CpusetMems,
        },
    }
    // ...
}

Docker cherry-picks only CPU/Memory from the CRI request — never touches devices.

That's why the same Kubernetes setup behaves completely differently across runtimes.

Why runc "refreshes" the whole device list

systemd's behavior

When systemd receives DeviceAllow via dbus (src/core/cgroup.c), it does:

if (c->device_allow || policy != CGROUP_DEVICE_POLICY_AUTO)
    r = cg_set_attribute("devices", path, "devices.deny", "a");  // deny all first
else
    r = cg_set_attribute("devices", path, "devices.allow", "a");

// then iterate device_allow, replay each one
LIST_FOREACH(device_allow, a, c->device_allow) {
    if (path_startswith(a->path, "/dev/"))
        r = bpf_devices_allow_list_device(prog, path, a->path, acc);
    else if ((val = startswith(a->path, "block-")))
        r = bpf_devices_allow_list_major(prog, path, val, 'b', acc);
    // ...
}

Wildcards trigger a sweep through /proc/devices, writing every matching major number:

int bpf_devices_allow_list_major(...) {
    if (streq(name, "*"))
        return allow_list_device_pattern(prog, path, type, NULL, NULL, acc);

    // For each line of /proc/devices, fnmatch
    f = fopen("/proc/devices", "re");
    for (;;) {
        // ... line by line; matching ones go through allow_list_device_pattern
    }
}

Every major matched by a wildcard becomes one write. That's why the containerd-side devices.list has so many c 1:* m, c 4:* m, c 5:* m, ...

runc's belt-and-suspenders

runc builds dbus properties before calling systemd (libcontainer/cgroups/systemd/v1.go):

func genV1ResourcesProperties(r *configs.Resources, cm *dbusConnManager) ([]systemdDbus.Property, error) {
    deviceProperties, _ := generateDeviceProperties(r)
    properties = append(properties, deviceProperties...)
    // cpu/memory/cpuset etc.
}

addCpuset, etc., have "skip if systemd is too old":

func addCpuset(...) error {
    sdVer := systemdVersion(cm)
    if sdVer < 244 {
        logrus.Debugf("systemd v%d is too old ...", sdVer)
        return nil
    }
}

Our systemd is v219 — CPU/Memory/Cpuset are all skipped on dbus because of version checks. Devices, though, still get passed. Then runc writes them again via cgroupfs as a fallback:

func (m *legacyManager) Set(r *configs.Resources) error {
    // First call systemd dbus
    setErr := setUnitProperties(m.dbus, unitName, properties...)

    // Then write each legacy subsystem via cgroupfs
    for _, sys := range legacySubsystems {
        path, _ := m.paths[sys.Name()]
        if err := sys.Set(path, r); err != nil {
            return err
        }
    }
}

So systemd writes once, runc writes again via cgroupfs. The device update is performed more than once per update call.

The protection that should exist: freeze

runc's comment is unambiguous:

// We have to freeze the container while systemd sets the cgroup settings.
// The reason for this is that systemd's application of DeviceAllow rules
// is done disruptively, resulting in spurrious errors to common devices
// (unlike our fs driver, they will happily write deny-all rules to running
// containers). So we freeze the container to avoid them hitting the cgroup
// error. But if the freezer cgroup isn't supported, we just warn about it.
if err := m.Freeze(configs.Frozen); err != nil {
    logrus.Infof("freeze container before SetUnitProperties failed: %v", err)
}

They already thought of this. The designed fix: freeze the container before device-rule update, thaw after. Workload doesn't see the transient state.

But in production we see:

... freeze container before SetUnitProperties failed: unable to freeze

Freeze fails. Look at the freezer cgroup code:

func (s *FreezerGroup) Set(path string, r *configs.Resources) (Err error) {
    for i := 0; i < 1000; i++ {
        // ... check state
        switch state {
            case "FREEZING":
                continue
        }
    }
    return errors.New("unable to freeze")
}

1000 retries (≈2s) fail to reach FROZEN, function returns error, but the caller in runc only logs Infof and continues with the full update.

Why doesn't freeze succeed? Most likely: too many threads or processes in D state (uninterruptible sleep). The freezer waits for every task to reach FROZEN; one stuck in D blocks it indefinitely. (We dug into this exact mechanism in MySQL freezes for 2s every few hours — the sister incident.)

The full chain:

kubelet syncs resources periodically


CRI: UpdateContainerResources (containerd, with devices field)


containerd → runc update

    ├─► runc Freeze()
    │     └─► 1000 retries fail → "unable to freeze"
    │     └─► only an Info log; proceeds anyway (!!)


runc → systemd dbus DeviceAllow

    └─► systemd: devices.deny=a first, then replay allow list

          └─► sub-second window: no device permissions at all

                └─► workload reads /dev/urandom every 1ms → falls in
                      └─► open returns EPERM → std::random_device throws → abort

Verification experiments

Experiment 1: switch cgroup driver to cgroupfs

Set containerd SystemdCgroup=false and kubelet cgroup-driver=cgroupfs. Re-run. Issue goes away.

But this changes the entire shape of the device cgroup; devices.list is shorter too. So this alone doesn't 100% prove "deny+allow gap is the culprit" — needs further exclusion.

Experiment 2: strip devices in the runc hook

Wrap runc with a hook that empties the devices field:

#!/bin/bash
data=$(cat /dev/stdin)
if [[ $data ]]; then
    newdata=$(echo $data | jq '.devices=[]' -c)
    runc.original --debug --log /tmp/hook-runc.log $@ <<< "$newdata"
fi

Replace /bin/runc with this. Issue persists. That means the OCI default deny-all-then-allow-rwm rule isn't the culprit either.

Experiment 3: log runc's actual dbus properties

logrus.Debugf("set cgroup properties => %v", properties)

Observe what runc passes to systemd. No device-related fields in the properties dbus is being asked to set. So device rules aren't pushed by runc into dbus — instead, systemd, on receiving any Set call, sees existing DeviceAllow= (from an earlier create call), and rewrites the whole list anyway.

That matches the systemd code: systemd doesn't care what specifically changed; if DeviceAllow exists, it runs deny+allow.

Fix

Workload-side

Direct: replace std::random_device with a one-time seeded PRNG. Intrusive, unfriendly to users.

runc-side

The best fix: runc must not continue the update when freeze fails. Merged upstream — if freeze fails AND the update includes device rules, runc bails and lets the caller decide (containerd can retry, degrade, or alert) instead of silently breaking the workload.

Workload-side (indirect)

For freezer to actually work, avoid large piles of D-state processes — optimize uninterruptible I/O and reduce process count so the protection mechanism activates.

Reflections

1. Same image, different runtime, completely different behavior. From the workload's view, "containerd vs docker" shouldn't matter. But once you go down into cgroup driver / runc / systemd, the differences are enormous. Many "implicit conventions" from the docker era broke after the move to containerd.

2. A "harmless" freeze failure isn't. runc's comment is crystal clear that freeze is required and device updates are disruptive. But on failure it only logs Info. From an SRE viewpoint, "protection mechanism failed; continue anyway" is a dangerous design. The right behavior: failure must abort the update.

3. Old systemd is a minefield. v219 with cgroupv1 + runc has a long list of subtle problems (cpuset unsupported, devices fully replayed, version-check branches everywhere). If production can't upgrade systemd, be aware.

4. High-frequency device access is a "workload habit" amplifier. std::random_device isn't a problem for most workloads. But for an ad-retrieval service constructing a fresh one per request, it became a perfect amplifier. When your code touches a resource every millisecond, any momentary unavailability becomes a guaranteed bug.

5. CRI-interface "patch responsibility" lies with no one. containerd passes the full Resources to runc — that's reasonable; it shouldn't assume the caller only wants CPU/Memory. But runc's device-update path under systemd cgroup driver is unsafe. The bug is in the runc-systemd chain, but only containerd users can hit it. The hardest production bugs are these — no single component is wrong, the combination is.

Related


Post-mortem of a production core-dump incident, March 2023. Internal hostnames, node names, and image names redacted.

Zoe

Written by

Zoe

AI Infra Engineer · LLM Serving · GPU/RDMA · indie hacker, obsessed with shipping tools

Comments