PLEG NotReady on short-task workloads: an overlooked lock contention inside containerd

13 min read

PLEG NotReady on short-task workloads: an overlooked lock contention inside containerd

Pods in the short-task resource pool kept ending up in Unknown, getting evicted, and nodes intermittently going NotReady. What made this annoying is that we'd "solved" it before — a previous fix using a container rw-layer disk split had cleared the same PLEG NotReady symptom in another scenario. But that fix did nothing for short-task workloads.

Starting from "why doesn't the previous fix work here?", we dug in again. The actual root cause turned out to be on a completely different path: containerd's internal Status lock being held hostage by an exiting container, blocking kubelet's ListContainers.

TL;DR

The NotReady chain

  1. On container exit, the shim runs umount2 on the rootfs. Without volatile mount, overlayfs triggers a disk sync.
  2. Under I/O pressure or with many dirty pages, umount2 takes significantly longer (we measured minutes).
  3. Every 10 seconds, kubelet calls CRI UpdateContainerResources. containerd's implementation takes the Status write lock first, then calls Task Update.
  4. Shim is stuck in umount2task.Update doesn't respond — Status lock is held the entire time.
  5. Meanwhile kubelet's PLEG calls ListContainers, which iterates every container and calls Status.Get(). The exiting container's Status lock is held by Update — Get blocks.
  6. One stuck container stalls the entire ListContainers. Kubelet relist times out (> 3 min). PLEG is not healthy fires. Node → NotReady.

Visualized:

Container exit

  ├─► shim runs umount2(rootfs)         ── slow sync, no response

  ├─► kubelet calls ContainerUpdateResource
  │     └─► takes Status.Lock
  │           └─► task.Update (RPC to shim)  ── blocked on shim

  └─► kubelet calls ListContainers
        └─► iterate → Status.Get()        ── lock unattainable
              └─► ListContainers times out
                    └─► PLEG NotReady

Three fix paths

Matching the three causes:

  1. Add volatile mount option to overlayfs — no sync on umount. Needs containerd ≥ 1.6.24 (containerd/containerd#8676), kernel > 5.10 (or backported 4.18). Verified in production.
  2. Skip Task Update for exiting containers: stop letting useless updates hold the Status lock. Code-verified, with one side effect (covered below).
  3. Investigate why shim doesn't respond immediately: the shim's delete flow holds a process lock that blocks Update from acquiring it. Deeper fix; future work.

Walkthrough: discovery → instrumentation → reproduction → patch.

Background

The short-task pool team reported task instances frequently failing — Unknown state, evictions. Grafana showed rising node NotReady frequency.

We'd previously published a fix in another scenario ("10 nodes bulk PLEG NotReady"), splitting image layer I/O from status persistence I/O. That worked brilliantly there.

But after canarying that fix on the short-task pool, NotReady didn't improve materially. Two questions:

  • Did the disk split actually solve the original root cause, or just a symptom?
  • Are short-task NotReady events a different root cause?

The disk split assumed "status update I/O on system disk is the problem." But if status update I/O is no longer the bottleneck, can exit-time umount still hurt ListContainers via a different path? That's where this investigation started.

Analysis

Step 1: instrument first

Pick a reproducible node

Short-task high-priority pool reproduces well at single-node level. We picked a node with daily NotReady events, set up periodic log captures.

PLEG metrics

Before this we only tracked node-level NotReady duration, not PLEG itself. So first: add two metrics to the in-house single-node diagnostics dashboard:

  • NotReady: node NotReady windows
  • PLEG Cost: kubelet relist P99 latency

Interesting finding: during multiple PLEG NotReady windows, PLEG metrics themselves were missing (1 sample per 15s). The highest captured value was 145s, below the 180s threshold. Even our metrics pipeline was affected by the underlying issue. Tough, but at least gave us coarse time windows.

Step 2: add instrumentation to containerd

Kubelet-side logs aren't enough. Go has no strace equivalent, so the simplest path is: modify source, rebuild, deploy to debug node.

Key code paths

ListContainers (pkg/cri/server/container_list.go):

func (c *criService) ListContainers(ctx context.Context, r *runtime.ListContainersRequest) (*runtime.ListContainersResponse, error) {
    containersInStore := c.containerStore.List()
    var containers []*runtime.Container
    for _, container := range containersInStore {
        containers = append(containers, toCRIContainer(container))
    }
    containers = c.filterCRIContainers(containers, r.GetFilter())
    return &runtime.ListContainersResponse{Containers: containers}, nil
}

func toCRIContainer(container containerstore.Container) *runtime.Container {
    status := container.Status.Get()
    return &runtime.Container{ /* ... */ }
}

Two potentially-slow spots: containerStore.List() and Status.Get().

UpdateContainerResources:

func (c *criService) UpdateContainerResources(ctx context.Context, r *runtime.UpdateContainerResourcesRequest) (*runtime.UpdateContainerResourcesResponse, error) {
    container, err := c.containerStore.Get(r.GetContainerId())
    if err != nil {
        return nil, fmt.Errorf("failed to find container: %w", err)
    }
    if err := container.Status.Update(func(status containerstore.Status) (containerstore.Status, error) {
        return status, c.updateContainerResources(ctx, container, r, status)
    }); err != nil {
        return nil, fmt.Errorf("failed to update resources: %w", err)
    }
    return &runtime.UpdateContainerResourcesResponse{}, nil
}

container.Status.Updatetakes the lock and holds it for the duration of the closure. The closure calls updateContainerResources, which internally calls task.Update:

if err := task.Update(ctx, containerd.WithResources(getResources(newSpec))); err != nil {
    if errdefs.IsNotFound(err) { return nil }
    return fmt.Errorf("failed to update resources: %w", err)
}

task.Update RPCs to shim. If shim is blocked, the Status lock is held the whole time — every Status.Get() queues up. ListContainers stalls.

Add timing logs

// ListContainers: total + per-container Status.Get
func (c *criService) ListContainers(...) {
    start := time.Now()
    defer func() {
        logrus.Debugf("[ListContainers] all list containers cost %s", time.Now().Sub(start))
    }()
}

func toCRIContainer(container containerstore.Container) *runtime.Container {
    start := time.Now()
    status := container.Status.Get()
    logrus.Debugf("[ListContainers] get container %s status cost %s", container.ID, time.Now().Sub(start))
}

// UpdateContainerResources: lock-hold + taskUpdate timing
func (c *criService) UpdateContainerResources(...) {
    if err := container.Status.Update(func(status containerstore.Status) (containerstore.Status, error) {
        start := time.Now()
        defer func() {
            logrus.Debugf("[UpdateContainerResources] %s cost %s", r.GetContainerId(), time.Now().Sub(start))
        }()
        return status, c.updateContainerResources(ctx, container, r, status)
    }); err != nil {
        return nil, fmt.Errorf("failed to update resources: %w", err)
    }
}

func (c *criService) updateContainerResources(...) {
    start := time.Now()
    defer func() {
        logrus.Debugf("[updateContainerSpec.taskUpdate] %s cost %s", id, time.Now().Sub(start))
    }()
    if err := task.Update(ctx, containerd.WithResources(getResources(newSpec))); err != nil {
        // ...
    }
}

Build, canary, wait for NotReady reproduction.

Step 3: the smoking-gun logs

After a few days, NotReady reproduced:

T09:30:11.885 level=debug msg="failed to delete task" error="context deadline exceeded" id=40585f73...
T09:30:11.885 level=error msg="failed to handle container TaskExit event ... ID:40585f73 ...
                                ExitedAt:2023-11-08 01:30:01 ..." error="failed to delete task: context deadline exceeded"
T09:30:21.510 level=error msg="get state for 40585f73..." error="context deadline exceeded"
T09:33:43.226 level=debug msg="[updateContainerSpec.taskUpdate] 40585f73... cost 1m57.542685352s"
T09:33:43.499 level=debug msg="[UpdateContainerResources] 40585f73... cost 2m0.271565471s"
T09:33:43.499 level=error msg="UpdateContainerResources for \"40585f73...\" failed" error="context canceled"
T09:33:43.499 level=debug msg="[ListContainers] get container 40585f73... status cost 1m28.992474091s"

The numbers:

Stage Duration
ListContainers.Get 1m29s
GetContainerStatus 1m28s
updateContainerResources 2m0s
updateContainerSpec.taskUpdate 1m57s
everything else ~3s

Clear conclusion: taskUpdate ate basically all of updateContainerResources's time, and because the Status lock is held, ListContainers is dragged along.

Step 4: deliberate reproduction

To confirm causality, deliberately reproduce.

Test pod

Two requirements:

  • restartPolicy: Never — exit triggers Task Delete (sandbox unmount).
  • Pin to debug node.
apiVersion: v1
kind: Pod
metadata:
  name: debug-for-cntr
spec:
  restartPolicy: Never
  nodeName: debug-node
  containers:
  - name: container1
    image: internal-registry/base/loop:latest
    command: ['sh', '-c', 'while true; do date; sleep 1; done']

Get the IDs:

CNTR_PID=$(ps -ef | grep 'while true' | grep -v grep | awk '{print $2}')
SHIM_PID=$(ps -ef | grep 'while true' | grep -v grep | awk '{print $3}')
SANDBOX_ID=$(cat /proc/$SHIM_PID/cmdline | tr '\0' ' ' | awk '{print $5}')

Inject umount2 latency

I/O pressure essentially means umount2 doesn't return. Use strace to inject 60s latency directly on the shim's umount2 (strace ≥ v4.22):

strace -p $SHIM_PID -e trace=umount2 -f -e inject=umount2:delay_enter=60000000 -b execve

Trigger exit

kill -9 $CNTR_PID

Watch the strace terminal:

[pid 843099] --- SIGCHLD ... ---
[pid 843100] umount2("/run/containerd/io.containerd.runtime.v2.task/k8s.io/2346f52c.../rootfs", 0 <unfinished ...>
... blocks for 60s ...
[pid 843100] <... umount2 resumed>)     = 0

Successfully injected; umount2 stalls. Note that the shim doesn't exit normally here — there's a separate containerd-1.6 shim-leak issue, out of scope for this post.

containerd logs:

T17:34:41 [updateContainerSpec] 771d1a... cost 63.682694ms
T17:34:43 level=error msg="get state for 771d1a..." error="context deadline exceeded"
T17:34:50 level=debug msg="failed to delete task" error="context deadline exceeded" id=771d1a...
T17:36:40 [updateContainerSpec.taskUpdate] 771d1a... cost 1m56.983028134s
T17:36:40 [UpdateContainerResources] 771d1a... cost 1m59.106818442s
T17:36:40 [ListContainers] get container 771d1a... status cost 1m58.185192038s
T17:36:40 level=error msg="UpdateContainerResources for \"771d1a...\" failed" \
          error="failed to update resources: cannot update a deleted process"
T17:36:40 [ListContainers] get container 771d1a... status cost 1m2.269374691s
T17:36:40 [ListContainers] get container 771d1a... status cost 1m55.208983726s
T17:36:40 [ListContainers] get container 771d1a... status cost 1m57.267298304s
...
T17:36:40 handleContainerExit status.updateSync 771d1a... cost 1m50.40571539s

Full reproduction:

  • Multiple concurrent ListContainers all block on the same container, up to 1m58s (umount injection is 1 min; actual wait is longer due to RPC timeouts and retries).
  • UpdateContainerResources 1m59s.
  • updateContainerSpec.taskUpdate 1m56s.

Causality closed.

Why does task.Update get blocked on the shim side?

Shim v2 server lives in runtime/v2/shim/shim.go:

func run(...) error {
    server, err := newServer()
    for _, srv := range ttrpcServices {
        if err := srv.RegisterTTRPC(server); err != nil {
            return fmt.Errorf("failed to register service: %w", err)
        }
    }
    if err := serve(ctx, server, signals, sd.Shutdown); err != nil {
        if err != shutdown.ErrShutdown { return err }
    }
    // Note: if the shim exits abnormally, the address socket can leak
    if address, err := ReadAddress("address"); err == nil {
        _ = RemoveSocket(address)
    }
}

TaskService impl is in runtime/v2/runc/service.go. Tracing Update:

func (s *service) Update(ctx context.Context, r *taskAPI.UpdateTaskRequest) (*ptypes.Empty, error) {
    container, err := s.getContainer()
    if err != nil { return nil, err }
    if err := container.Update(ctx, r); err != nil {
        return nil, errdefs.ToGRPC(err)
    }
    return empty, nil
}

Following into process/init.go delete: the delete flow holds a process lock, blocking Update from acquiring it. That's why shim doesn't respond. Deeper investigation deferred.

Solution

Two practical fixes:

  1. overlayfs volatile mount: eliminates sync at the source. Requires containerd ≥ 1.6.24 + kernel > 5.10 / backported 4.18. Infrastructure work, multi-batch canary, depends on kernel version.
  2. CRI: skip Task Update for exiting containers: small diff, direct payoff. Verified first.

Fix 2: Stopping flag, skip task.Update

Direct approach: when containerd receives TaskExit, mark the container as Stopping. Subsequent UpdateContainerResources checks the flag and returns early. No more wasted RPC on exiting containers, no more long Status-lock holds.

Diff:

diff --git a/pkg/cri/server/container_update_resources.go b/pkg/cri/server/container_update_resources.go
@@ -81,9 +81,9 @@ func (c *criService) updateContainerResources(ctx context.Context,
        newStatus = status
        id := cntr.ID
-       // Do not update the container when there is a removal in progress.
-       if status.Removing {
-               return newStatus, fmt.Errorf("container %q is in removing state", id)
+       // Do not update the container when there is a removal or stop in progress.
+       if status.Removing || status.Stopping {
+               return newStatus, fmt.Errorf("container %q is in removing or stopping state", id)
        }

diff --git a/pkg/cri/server/events.go b/pkg/cri/server/events.go
@@ -377,6 +377,13 @@ func (em *eventMonitor) handleEvent(any interface{}) error {

 // handleContainerExit handles TaskExit event for container.
 func handleContainerExit(ctx context.Context, e *eventtypes.TaskExit, cntr containerstore.Container, sandboxID string, c *criService) error {
+       // First we need to update the status of container to a Stopping state
+       // to avoid container updating from UpdateContainerResources
+       _ = cntr.Status.Update(func(status containerstore.Status) (containerstore.Status, error) {
+               status.Stopping = true
+               return status, nil
+       })
+
        // Attach container IO so that `Delete` could cleanup the stream properly.

@@ -457,6 +464,9 @@ func handleContainerExit(ctx context.Context, e *eventtypes.TaskExit, cntr conta
                        status.ExitCode = int32(e.ExitStatus)
                }
+               // Reset the stopping state
+               status.Stopping = false

diff --git a/pkg/cri/store/container/status.go b/pkg/cri/store/container/status.go
@@ -90,6 +90,8 @@ type Status struct {
        Starting bool `json:"-"`
+       // Stopping indicates that the container is in stopping state.
+       Stopping bool `json:"-"`
        Removing bool `json:"-"`

Build, canary, re-trigger umount2 latency. No more sub-second Status.Get spikes. PLEG metrics back to normal.

Side effect: kubelet rebuilds pods

In canary, some task pods got rebuilt by kubelet:

10:00:14 Receiving a new pod ...
10:00:16 Creating container 84376ad6ce3
10:00:16 PLEG ContainerStarted -> Running
...
11:45:36 computePodActions ContainersToKill{84376ad6ce3}
11:45:36 Killing unwanted container 84376ad6ce3
11:45:42 SyncLoop (PLEG) ContainerDied 84376ad6ce3
11:45:49 Status for pod ... {Phase:Pending} ContainerStatuses{Waiting}
11:47:02 Creating container 730f6e6451a

The user reported actual task runtime was 10:00:1311:45:25 — kubelet triggered a rebuild after natural completion. Looking at kuberuntime_manager.go:

if containerStatus == nil || containerStatus.State != kubecontainer.ContainerStateRunning {
    if kubecontainer.ShouldContainerBeRestarted(&container, pod, podStatus) {
        changes.ContainersToStart = append(changes.ContainersToStart, idx)
        if containerStatus != nil && containerStatus.State == kubecontainer.ContainerStateUnknown {
            // If container is in unknown state, we don't know whether it
            // is actually running or not, always try killing it before
            // restart to avoid having 2 running instances of the same container.
            changes.ContainersToKill[containerStatus.ID] = containerToKillInfo{
                // ...
            }
        }
    }
    continue
}

Inference: when containerd returns an error for a Stopping container, the container's state in kubelet's view becomes Unknown on some paths, triggering "rather kill twice than skip once" rebuild logic.

Mitigation needs refinement — during Stopping, what Status.Get returns should be more carefully marked to keep kubelet out of the rebuild branch. Discussion ongoing with the community.

Reflections

1. Metrics first. Adding "PLEG cost" was the prerequisite for everything that followed. Diagnostic eyes are non-negotiable.

2. Don't be misled by prior fixes. The earlier disk split worked for that scenario. That doesn't mean it works for all PLEG NotReady. Disk split fixed Status update I/O; this one is Status lock blocked by task.Update on exit. Same symptom, completely different path.

3. Lock-hold time vs. lock-acquisition cost. Go locks are cheap. But holding a lock during a multi-minute remote RPC kills the process's concurrency. "Call an RPC inside a lock" is fine when traffic is low and the remote is healthy — but it's a landmine.

4. containerd's own fragility. Shim's delete flow holds a process lock that blocks task.Update. Shim leaks sockets when umount fails. These are the same root issue (resources not promptly released on exception paths) in different shapes.

5. Fix one problem, get another. The CRI fix restored PLEG, but kubelet's Unknown rebuild side effect surfaced. Any state-machine change at the bottom needs the upper-layer recovery paths walked too — side effects often hurt more than the original symptom.

Related


Stability investigation post-mortem, Oct–Nov 2023. Internal links and node names redacted.

Zoe

Written by

Zoe

AI Infra Engineer · LLM Serving · GPU/RDMA · indie hacker, obsessed with shipping tools

Comments