PLEG NotReady on Kubelet: from one-off fixes to a systemic cure
PLEG NotReady on Kubelet: from one-off fixes to a systemic cure
PLEG NotReady has been one of the top stability issues on our clusters for the past couple of years. Bulk NotReady events, node evictions, short-task restarts — multiple major incidents per year, each affecting dozens of workloads for 30+ minutes. We'd already published two posts narrowly fixing it — debugging PLEG NotReady for short-task workloads and containerd init container stuck in Running. Both were single-point fixes. This post is the systemic one — full path from root cause to production rollout.
Why a "systemic cure"
Our clusters had hit bulk NotReady incidents (>10 nodes, 3+ events) multiple times — dozens of workloads affected, 30+ minutes per event. Different proximate causes each time — heavy I/O, dbus timeout, init container hanging on exit — but the same underlying path: oversized or excessive Status-lock critical sections in containerd's CRI plugin.
The two previous posts each fixed one symptom:
- Short-task PLEG NotReady investigation: localized to slow
umount2+Statuslock contention. Fix: short-circuitUpdatefor exiting containers. - containerd container-rwlayer disk split: split metadata I/O from business I/O on a different I/O hotspot.
Both are narrow fixes — one for short-task exits, one for metadata I/O. As cluster scale grew, new PLEG NotReady incidents kept appearing in new shapes — e.g. task.Update on a Running container blocked because of a slow dbus call.
So this round's goal: walk every long call inside the Status lock and build a systematic remediation path — not "fix-as-you-find-it."
What PLEG NotReady really is
How PLEG works
PLEG (Pod Lifecycle Event Generator) is a Kubelet subsystem:
- Periodically (~1s default, max 3 min) pulls the latest container list from the container runtime.
- Diffs against the local cache, generates lifecycle events → pushes to a channel.
- The main Kubelet loop consumes events for scheduling and status sync.
The hard threshold for NotReady: if PLEG hasn't completed a single relist within 3 minutes, kubelet declares the node unhealthy, reports to the API server, and the node is marked NotReady.
Where relist slows down
The relist path:
Kubelet
└─► CRI ListContainers
└─► iterate containers
└─► Container.Status.Get() ← takes read lock
└─► blocks if anyone holds write lock
If any container's Status write lock is held for long, ListContainers can't return — relist times out.
Who holds Status write lock for long
Walking containerd's CRI plugin code, all the long-held Status write locks fall into two buckets:
UpdateSyncpaths (4 sites): sync-write to disk. Latency tied to disk I/O.Updatepaths (6 sites, 1 with a long critical section): call shim'stask.UpdateRPC + containerd spec update inside the lock. Latency tied to shim health.
Documented slow sources from real incidents:
| Slow source | Bucket | Trigger |
|---|---|---|
umount2(rootfs) blocking sync |
Update | Container exit + high I/O + many dirty pages |
| Slow metadata writes | UpdateSync | System disk I/O pressure |
task.Update dbus timeout |
Update | Update on running container with systemd cgroup driver |
Each is enough to block ListContainers for 3+ minutes.
The systemic fix
Two tracks, one per slow source.
Track A: overlayfs volatile mount — eliminate umount sync at the source
Idea: make overlayfs skip the sync on umount. Cuts the "container exit → slow umount" chain at its root.
- Kernel requirement: kernel ≥ 5.x, or a backported 4.18 with the patch.
- containerd requirement: 1.6.24 supports the
volatilemount option (containerd/containerd#8676). - Rollout: upgrade containerd + enable
volatilein config. - Limitation: only affects new containers. Pre-existing containers aren't covered.
Track B: skip UpdateContainerResources for exiting containers — eliminate long RPC inside the lock
Idea: add a Stopping flag to Status. Set it the moment a container starts exiting. The update_resource path checks the flag and bails immediately without calling shim.
Core diff (simplified):
// Container resource update: skip exiting and removing containers
func (c *criService) updateContainerResources(ctx context.Context,
status containerstore.Status) (retErr error) {
if status.Removing || status.Stopping {
return fmt.Errorf("container %q is in removing or stopping state", id)
}
// ...
}
// Container exit handler: mark Stopping immediately
func handleContainerExit(ctx context.Context, e *eventtypes.TaskExit, cntr containerstore.Container) error {
// First we need to update the status to Stopping
// to avoid container update from update_resource
_ = cntr.Status.Update(func(status containerstore.Status) (containerstore.Status, error) {
status.Stopping = true
return status, nil
})
// ... task.Delete + UpdateSync ...
}
// New field on Status
type Status struct {
// ...
Stopping bool `json:"-"` // not persisted; cleared on restart
}
Key points:
Stoppingis in-memory only (json:"-") — naturally cleared on containerd restart.- Set exactly once on the exit event; cleared when the container object is garbage collected after
UpdateSync. - Zero impact on healthy containers — small diff, low risk.
Trade-offs
| Dimension | Track A (volatile) | Track B (Stopping flag) |
|---|---|---|
| Rollout difficulty | High (needs kernel ≥ 4.18+patch / 5.x AND containerd upgrade) | Low (containerd upgrade + restart) |
| Affects existing containers | No | Yes |
| Fully eliminates umount sync | Yes | No (just keeps it out of the lock) |
| Risk | Kernel + containerd dual dependency | containerd-only, small diff |
Decision: run both in parallel. Track B ships first (stops the bleeding). Track A follows (eliminates the root).
Track C (backstop): split CRI status onto its own disk
For the UpdateSync line (slow metadata I/O), split CRI status data onto its own disk:
root_dir = "/run/containerd/io.containerd.grpc.v1.cri"
# Move existing status to tmpfs at /run (rsync first to avoid losing data on restart)
rsync -rc /media/disk1/containerd/io.containerd.grpc.v1.cri/ \
/run/containerd/io.containerd.grpc.v1.cri
Trade-off: tmpfs is fast but uses RAM. If memory is tight, mount a dedicated SSD partition for status instead.
Production rollout
Canary
Track B canary as 5+1 A/B:
- 5 nodes upgraded to the patched containerd (treatment).
- 1 node left unchanged (control).
- Same workloads, same resource pool.
- Watch for one month: PLEG NotReady frequency, relist P99,
UpdateContainerResourcesfailure rate.
Results:
- Treatment group: no PLEG NotReady events.
- Control group: ongoing NotReady (~1–2 per week).
- Workloads: no regressions.
Full rollout + rollback plan
RPM-based rollout, RPM-based rollback. The core logic:
# Forward: upgrade to new version
rpm -Uvh containerd.io-<new>.rpm
# Rollback: downgrade (must use --oldpackage)
rpm -Uvh --oldpackage containerd.io-<old>.rpm
Key points:
- Package files hosted on internal storage; downloads wrapped in a script.
- Rollback must use
--oldpackage, or RPM will refuse the downgrade. - Upgrade doesn't restart containers (just daemon-reload to pick up new binary).
- After full rollout: PLEG NotReady went from ~30 events/day to 0/day. Production pool incidents dropped significantly.
Aftermath: dbus timeouts
After Track B was fully rolled out, we hit a new flavor: task.Update on Running containers also times out. This time it's not slow umount — it's task.Update going through systemd dbus under the systemd cgroup driver, and dbus itself stalling.
Short-term fix: reinstall polkit and restart dbus. But this exposed a new problem — calling dbus inside the Status lock is itself a landmine. The community already has discussions on this; it's the next item on the remediation roadmap.
Results
Quantified outcomes:
- PLEG NotReady frequency: 30/day → 0/day.
- Production pool incidents: -2 to -3 per year.
- Short-task high-priority pool: -1 eviction event/day, avoiding ~80 short-task restarts/day.
- Workload impact: from "dozens of teams affected" → "zero" (via this path).
Looking back
PLEG NotReady is fundamentally infrastructure-level long-tail jitter amplified by an upper-layer state machine:
- A slow umount, a stuck dbus call, a disk hiccup — individually, none of them is a big deal.
- The serial bottleneck of the
Statuslock turns "one slow container" into "the whole node is stuck." - The PLEG 3-minute hard threshold then turns that into NotReady → eviction → user-visible outage.
So curing this class of problem isn't about fixing each slow source individually. It's about:
- Dismantle the serial bottleneck — the
Stoppingflag in this post takes exiting containers off the Update path. - Reduce dependencies behind the hard threshold — volatile mount + metadata-disk split keep
ListContainersaway from business I/O. - Don't call external RPCs inside the lock — the dbus issue is the cautionary tale. Top item for the next phase.
The next phase: audit every "call external RPC inside a lock" site in containerd's CRI plugin. That's the next chapter of this remediation.
Related
- Short-task PLEG NotReady investigation — single-point investigation
- containerd init container stuck in Running — another way the same slow-umount chain can fail
- containerd PR #8676 — overlay volatile mount
- Kubelet PLEG design doc

Written by
Zoe
AI Infra Engineer · LLM Serving · GPU/RDMA · indie hacker, obsessed with shipping tools