PLEG NotReady on Kubelet: from one-off fixes to a systemic cure

PLEG NotReady has been one of the top stability issues on our clusters for the past couple of years. Bulk NotReady events, node evictions, short-task restarts — multiple major incidents per year, each affecting dozens of workloads for 30+ minutes. We'd already published two posts narrowly fixing it — debugging PLEG NotReady for short-task workloads and containerd init container stuck in Running. Both were single-point fixes. This post is the systemic one — full path from root cause to production rollout.

Why a "systemic cure"

Our clusters had hit bulk NotReady incidents (>10 nodes, 3+ events) multiple times — dozens of workloads affected, 30+ minutes per event. Different proximate causes each time — heavy I/O, dbus timeout, init container hanging on exit — but the same underlying path: oversized or excessive Status-lock critical sections in containerd's CRI plugin.

The two previous posts each fixed one symptom:

Short-task PLEG NotReady investigation: localized to slow umount2 + Status lock contention. Fix: short-circuit Update for exiting containers.
containerd container-rwlayer disk split: split metadata I/O from business I/O on a different I/O hotspot.

Both are narrow fixes — one for short-task exits, one for metadata I/O. As cluster scale grew, new PLEG NotReady incidents kept appearing in new shapes — e.g. task.Update on a Running container blocked because of a slow dbus call.

So this round's goal: walk every long call inside the Status lock and build a systematic remediation path — not "fix-as-you-find-it."

What PLEG NotReady really is

How PLEG works

PLEG (Pod Lifecycle Event Generator) is a Kubelet subsystem:

Periodically (~1s default, max 3 min) pulls the latest container list from the container runtime.
Diffs against the local cache, generates lifecycle events → pushes to a channel.
The main Kubelet loop consumes events for scheduling and status sync.

The hard threshold for NotReady: if PLEG hasn't completed a single relist within 3 minutes, kubelet declares the node unhealthy, reports to the API server, and the node is marked NotReady.

Where relist slows down

The relist path:

Kubelet
  └─► CRI ListContainers
        └─► iterate containers
              └─► Container.Status.Get()    ← takes read lock
                    └─► blocks if anyone holds write lock

If any container's Status write lock is held for long, ListContainers can't return — relist times out.

Who holds `Status` write lock for long

Walking containerd's CRI plugin code, all the long-held Status write locks fall into two buckets:

UpdateSync paths (4 sites): sync-write to disk. Latency tied to disk I/O.
Update paths (6 sites, 1 with a long critical section): call shim's task.Update RPC + containerd spec update inside the lock. Latency tied to shim health.

Documented slow sources from real incidents:

Slow source	Bucket	Trigger
`umount2(rootfs)` blocking sync	Update	Container exit + high I/O + many dirty pages
Slow metadata writes	UpdateSync	System disk I/O pressure
`task.Update` dbus timeout	Update	Update on running container with systemd cgroup driver

Each is enough to block ListContainers for 3+ minutes.

The systemic fix

Two tracks, one per slow source.

Track A: overlayfs `volatile` mount — eliminate umount sync at the source

Idea: make overlayfs skip the sync on umount. Cuts the "container exit → slow umount" chain at its root.

Kernel requirement: kernel ≥ 5.x, or a backported 4.18 with the patch.
containerd requirement: 1.6.24 supports the volatile mount option (containerd/containerd#8676).
Rollout: upgrade containerd + enable volatile in config.
Limitation: only affects new containers. Pre-existing containers aren't covered.

Track B: skip `UpdateContainerResources` for exiting containers — eliminate long RPC inside the lock

Idea: add a Stopping flag to Status. Set it the moment a container starts exiting. The update_resource path checks the flag and bails immediately without calling shim.

Core diff (simplified):

// Container resource update: skip exiting and removing containers
func (c *criService) updateContainerResources(ctx context.Context,
    status containerstore.Status) (retErr error) {

    if status.Removing || status.Stopping {
        return fmt.Errorf("container %q is in removing or stopping state", id)
    }
    // ...
}

// Container exit handler: mark Stopping immediately
func handleContainerExit(ctx context.Context, e *eventtypes.TaskExit, cntr containerstore.Container) error {
    // First we need to update the status to Stopping
    // to avoid container update from update_resource
    _ = cntr.Status.Update(func(status containerstore.Status) (containerstore.Status, error) {
        status.Stopping = true
        return status, nil
    })
    // ... task.Delete + UpdateSync ...
}

// New field on Status
type Status struct {
    // ...
    Stopping bool `json:"-"`   // not persisted; cleared on restart
}

Key points:

Stopping is in-memory only (json:"-") — naturally cleared on containerd restart.
Set exactly once on the exit event; cleared when the container object is garbage collected after UpdateSync.
Zero impact on healthy containers — small diff, low risk.

Trade-offs

Dimension	Track A (volatile)	Track B (Stopping flag)
Rollout difficulty	High (needs kernel ≥ 4.18+patch / 5.x AND containerd upgrade)	Low (containerd upgrade + restart)
Affects existing containers	No	Yes
Fully eliminates umount sync	Yes	No (just keeps it out of the lock)
Risk	Kernel + containerd dual dependency	containerd-only, small diff

Decision: run both in parallel. Track B ships first (stops the bleeding). Track A follows (eliminates the root).

Track C (backstop): split CRI status onto its own disk

For the UpdateSync line (slow metadata I/O), split CRI status data onto its own disk:

root_dir = "/run/containerd/io.containerd.grpc.v1.cri"

# Move existing status to tmpfs at /run (rsync first to avoid losing data on restart)
rsync -rc /media/disk1/containerd/io.containerd.grpc.v1.cri/ \
          /run/containerd/io.containerd.grpc.v1.cri

Trade-off: tmpfs is fast but uses RAM. If memory is tight, mount a dedicated SSD partition for status instead.

Production rollout

Canary

Track B canary as 5+1 A/B:

5 nodes upgraded to the patched containerd (treatment).
1 node left unchanged (control).
Same workloads, same resource pool.
Watch for one month: PLEG NotReady frequency, relist P99, UpdateContainerResources failure rate.

Results:

Treatment group: no PLEG NotReady events.
Control group: ongoing NotReady (~1–2 per week).
Workloads: no regressions.

Full rollout + rollback plan

RPM-based rollout, RPM-based rollback. The core logic:

# Forward: upgrade to new version
rpm -Uvh containerd.io-&#x3C;new>.rpm

# Rollback: downgrade (must use --oldpackage)
rpm -Uvh --oldpackage containerd.io-&#x3C;old>.rpm

Key points:

Package files hosted on internal storage; downloads wrapped in a script.
Rollback must use --oldpackage, or RPM will refuse the downgrade.
Upgrade doesn't restart containers (just daemon-reload to pick up new binary).
After full rollout: PLEG NotReady went from ~30 events/day to 0/day. Production pool incidents dropped significantly.

Aftermath: dbus timeouts

After Track B was fully rolled out, we hit a new flavor: task.Update on Running containers also times out. This time it's not slow umount — it's task.Update going through systemd dbus under the systemd cgroup driver, and dbus itself stalling.

Short-term fix: reinstall polkit and restart dbus. But this exposed a new problem — calling dbus inside the Status lock is itself a landmine. The community already has discussions on this; it's the next item on the remediation roadmap.

Results

Quantified outcomes:

PLEG NotReady frequency: 30/day → 0/day.
Production pool incidents: -2 to -3 per year.
Short-task high-priority pool: -1 eviction event/day, avoiding ~80 short-task restarts/day.
Workload impact: from "dozens of teams affected" → "zero" (via this path).

Looking back

PLEG NotReady is fundamentally infrastructure-level long-tail jitter amplified by an upper-layer state machine:

A slow umount, a stuck dbus call, a disk hiccup — individually, none of them is a big deal.
The serial bottleneck of the Status lock turns "one slow container" into "the whole node is stuck."
The PLEG 3-minute hard threshold then turns that into NotReady → eviction → user-visible outage.

So curing this class of problem isn't about fixing each slow source individually. It's about:

Dismantle the serial bottleneck — the Stopping flag in this post takes exiting containers off the Update path.
Reduce dependencies behind the hard threshold — volatile mount + metadata-disk split keep ListContainers away from business I/O.
Don't call external RPCs inside the lock — the dbus issue is the cautionary tale. Top item for the next phase.

The next phase: audit every "call external RPC inside a lock" site in containerd's CRI plugin. That's the next chapter of this remediation.

Short-task PLEG NotReady investigation — single-point investigation
containerd init container stuck in Running — another way the same slow-umount chain can fail
containerd PR #8676 — overlay volatile mount
Kubelet PLEG design doc

PLEG NotReady on Kubelet: from one-off fixes to a systemic cure

PLEG NotReady on Kubelet: from one-off fixes to a systemic cure

Why a "systemic cure"

What PLEG NotReady really is

How PLEG works

Where relist slows down

Who holds `Status` write lock for long

The systemic fix

Track A: overlayfs `volatile` mount — eliminate umount sync at the source

Track B: skip `UpdateContainerResources` for exiting containers — eliminate long RPC inside the lock

Trade-offs

Track C (backstop): split CRI status onto its own disk

Production rollout

Canary

Full rollout + rollback plan

Aftermath: dbus timeouts

Results

Looking back

Related

Comments

PLEG NotReady on Kubelet: from one-off fixes to a systemic cure

PLEG NotReady on Kubelet: from one-off fixes to a systemic cure

Why a "systemic cure"

What PLEG NotReady really is

How PLEG works

Where relist slows down

Who holds Status write lock for long

The systemic fix

Track A: overlayfs volatile mount — eliminate umount sync at the source

Track B: skip UpdateContainerResources for exiting containers — eliminate long RPC inside the lock

Trade-offs

Track C (backstop): split CRI status onto its own disk

Production rollout

Canary

Full rollout + rollback plan

Aftermath: dbus timeouts

Results

Looking back

Related

Comments

Who holds `Status` write lock for long

Track A: overlayfs `volatile` mount — eliminate umount sync at the source

Track B: skip `UpdateContainerResources` for exiting containers — eliminate long RPC inside the lock