PLEG NotReady on Kubelet: from one-off fixes to a systemic cure

8 min read

PLEG NotReady on Kubelet: from one-off fixes to a systemic cure

PLEG NotReady has been one of the top stability issues on our clusters for the past couple of years. Bulk NotReady events, node evictions, short-task restarts — multiple major incidents per year, each affecting dozens of workloads for 30+ minutes. We'd already published two posts narrowly fixing it — debugging PLEG NotReady for short-task workloads and containerd init container stuck in Running. Both were single-point fixes. This post is the systemic one — full path from root cause to production rollout.

Why a "systemic cure"

Our clusters had hit bulk NotReady incidents (>10 nodes, 3+ events) multiple times — dozens of workloads affected, 30+ minutes per event. Different proximate causes each time — heavy I/O, dbus timeout, init container hanging on exit — but the same underlying path: oversized or excessive Status-lock critical sections in containerd's CRI plugin.

The two previous posts each fixed one symptom:

Both are narrow fixes — one for short-task exits, one for metadata I/O. As cluster scale grew, new PLEG NotReady incidents kept appearing in new shapes — e.g. task.Update on a Running container blocked because of a slow dbus call.

So this round's goal: walk every long call inside the Status lock and build a systematic remediation path — not "fix-as-you-find-it."

What PLEG NotReady really is

How PLEG works

PLEG (Pod Lifecycle Event Generator) is a Kubelet subsystem:

  • Periodically (~1s default, max 3 min) pulls the latest container list from the container runtime.
  • Diffs against the local cache, generates lifecycle events → pushes to a channel.
  • The main Kubelet loop consumes events for scheduling and status sync.

The hard threshold for NotReady: if PLEG hasn't completed a single relist within 3 minutes, kubelet declares the node unhealthy, reports to the API server, and the node is marked NotReady.

Where relist slows down

The relist path:

Kubelet
  └─► CRI ListContainers
        └─► iterate containers
              └─► Container.Status.Get()    ← takes read lock
                    └─► blocks if anyone holds write lock

If any container's Status write lock is held for long, ListContainers can't return — relist times out.

Who holds Status write lock for long

Walking containerd's CRI plugin code, all the long-held Status write locks fall into two buckets:

  1. UpdateSync paths (4 sites): sync-write to disk. Latency tied to disk I/O.
  2. Update paths (6 sites, 1 with a long critical section): call shim's task.Update RPC + containerd spec update inside the lock. Latency tied to shim health.

Documented slow sources from real incidents:

Slow source Bucket Trigger
umount2(rootfs) blocking sync Update Container exit + high I/O + many dirty pages
Slow metadata writes UpdateSync System disk I/O pressure
task.Update dbus timeout Update Update on running container with systemd cgroup driver

Each is enough to block ListContainers for 3+ minutes.

The systemic fix

Two tracks, one per slow source.

Track A: overlayfs volatile mount — eliminate umount sync at the source

Idea: make overlayfs skip the sync on umount. Cuts the "container exit → slow umount" chain at its root.

  • Kernel requirement: kernel ≥ 5.x, or a backported 4.18 with the patch.
  • containerd requirement: 1.6.24 supports the volatile mount option (containerd/containerd#8676).
  • Rollout: upgrade containerd + enable volatile in config.
  • Limitation: only affects new containers. Pre-existing containers aren't covered.

Track B: skip UpdateContainerResources for exiting containers — eliminate long RPC inside the lock

Idea: add a Stopping flag to Status. Set it the moment a container starts exiting. The update_resource path checks the flag and bails immediately without calling shim.

Core diff (simplified):

// Container resource update: skip exiting and removing containers
func (c *criService) updateContainerResources(ctx context.Context,
    status containerstore.Status) (retErr error) {

    if status.Removing || status.Stopping {
        return fmt.Errorf("container %q is in removing or stopping state", id)
    }
    // ...
}

// Container exit handler: mark Stopping immediately
func handleContainerExit(ctx context.Context, e *eventtypes.TaskExit, cntr containerstore.Container) error {
    // First we need to update the status to Stopping
    // to avoid container update from update_resource
    _ = cntr.Status.Update(func(status containerstore.Status) (containerstore.Status, error) {
        status.Stopping = true
        return status, nil
    })
    // ... task.Delete + UpdateSync ...
}

// New field on Status
type Status struct {
    // ...
    Stopping bool `json:"-"`   // not persisted; cleared on restart
}

Key points:

  • Stopping is in-memory only (json:"-") — naturally cleared on containerd restart.
  • Set exactly once on the exit event; cleared when the container object is garbage collected after UpdateSync.
  • Zero impact on healthy containers — small diff, low risk.

Trade-offs

Dimension Track A (volatile) Track B (Stopping flag)
Rollout difficulty High (needs kernel ≥ 4.18+patch / 5.x AND containerd upgrade) Low (containerd upgrade + restart)
Affects existing containers No Yes
Fully eliminates umount sync Yes No (just keeps it out of the lock)
Risk Kernel + containerd dual dependency containerd-only, small diff

Decision: run both in parallel. Track B ships first (stops the bleeding). Track A follows (eliminates the root).

Track C (backstop): split CRI status onto its own disk

For the UpdateSync line (slow metadata I/O), split CRI status data onto its own disk:

root_dir = "/run/containerd/io.containerd.grpc.v1.cri"

# Move existing status to tmpfs at /run (rsync first to avoid losing data on restart)
rsync -rc /media/disk1/containerd/io.containerd.grpc.v1.cri/ \
          /run/containerd/io.containerd.grpc.v1.cri

Trade-off: tmpfs is fast but uses RAM. If memory is tight, mount a dedicated SSD partition for status instead.

Production rollout

Canary

Track B canary as 5+1 A/B:

  • 5 nodes upgraded to the patched containerd (treatment).
  • 1 node left unchanged (control).
  • Same workloads, same resource pool.
  • Watch for one month: PLEG NotReady frequency, relist P99, UpdateContainerResources failure rate.

Results:

  • Treatment group: no PLEG NotReady events.
  • Control group: ongoing NotReady (~1–2 per week).
  • Workloads: no regressions.

Full rollout + rollback plan

RPM-based rollout, RPM-based rollback. The core logic:

# Forward: upgrade to new version
rpm -Uvh containerd.io-<new>.rpm

# Rollback: downgrade (must use --oldpackage)
rpm -Uvh --oldpackage containerd.io-<old>.rpm

Key points:

  1. Package files hosted on internal storage; downloads wrapped in a script.
  2. Rollback must use --oldpackage, or RPM will refuse the downgrade.
  3. Upgrade doesn't restart containers (just daemon-reload to pick up new binary).
  4. After full rollout: PLEG NotReady went from ~30 events/day to 0/day. Production pool incidents dropped significantly.

Aftermath: dbus timeouts

After Track B was fully rolled out, we hit a new flavor: task.Update on Running containers also times out. This time it's not slow umount — it's task.Update going through systemd dbus under the systemd cgroup driver, and dbus itself stalling.

Short-term fix: reinstall polkit and restart dbus. But this exposed a new problem — calling dbus inside the Status lock is itself a landmine. The community already has discussions on this; it's the next item on the remediation roadmap.

Results

Quantified outcomes:

  • PLEG NotReady frequency: 30/day → 0/day.
  • Production pool incidents: -2 to -3 per year.
  • Short-task high-priority pool: -1 eviction event/day, avoiding ~80 short-task restarts/day.
  • Workload impact: from "dozens of teams affected" → "zero" (via this path).

Looking back

PLEG NotReady is fundamentally infrastructure-level long-tail jitter amplified by an upper-layer state machine:

  • A slow umount, a stuck dbus call, a disk hiccup — individually, none of them is a big deal.
  • The serial bottleneck of the Status lock turns "one slow container" into "the whole node is stuck."
  • The PLEG 3-minute hard threshold then turns that into NotReady → eviction → user-visible outage.

So curing this class of problem isn't about fixing each slow source individually. It's about:

  1. Dismantle the serial bottleneck — the Stopping flag in this post takes exiting containers off the Update path.
  2. Reduce dependencies behind the hard threshold — volatile mount + metadata-disk split keep ListContainers away from business I/O.
  3. Don't call external RPCs inside the lock — the dbus issue is the cautionary tale. Top item for the next phase.

The next phase: audit every "call external RPC inside a lock" site in containerd's CRI plugin. That's the next chapter of this remediation.

Related

Zoe

Written by

Zoe

AI Infra Engineer · LLM Serving · GPU/RDMA · indie hacker, obsessed with shipping tools

Comments