containerd init container stuck in Running: an rootfs-umount timeout corrupts the state machine

On a staging node, an init-proxy container had clearly exited 15+ minutes ago, but from the CRI's perspective it was still Running. Every business container in that pod was waiting for the init to finish. The whole pod was wedged.

This is the same forest as our earlier short-task PLEG NotReady investigation — same slow umount2, same CRI state machine — but the failure mode is completely different.

TL;DR

Symptom

The init-proxy container actually exited at 14:25. The CRI didn't see "Exited" until 14:40 — a 15-minute delay. Business containers waited the whole time. The pod was stuck.

Root cause

On task exit, containerd's CRI plugin in handleContainerExit calls task.Delete first, then updates Status:

TaskExit event
  ├─► task.Delete(...)           ← shim runs umount2(rootfs) here
  │     ↑ umount2 times out under I/O pressure
  │     └─► RPC returns "context deadline exceeded"
  │
  └─► return early on error → never reaches:
        cntr.Status.UpdateSync(... Running → Exited ...)

In short: if task.Delete fails, Status stays at Running forever. Kubelet sees a "dead-but-not-quite-dead" zombie container.

Fix options

Mitigation: relieve node I/O pressure — SSD for rootfs, or overlayfs volatile mount to skip the sync.
Root fix: change CRI's state-transition order — write Status=Exited first, then wait for rootfs unmount. Has a shim-leak risk; needs careful design.
Stopgap: when task.Delete fails, continue with the Status update anyway. The containerd community has been discussing this.

Walking through detection → reproduction → diagnosis below.

Background

One morning, a staging-cluster node's init-proxy container was clearly hanging long after it should have exited. The user pod was stuck initializing. Ops grabbed crictl output — by eye, the business container was waiting on init-proxy, but init-proxy showed Running in CRI with a startedAt 15+ minutes in the past.

On the node, looking at containerd's status file (/media/disk1/containerd/io.containerd.grpc.v1.cri/containers/<cid>/status):

File: 'status'
Size: 159     Blocks: 8    IO Block: 4096
Access: 2023-12-28 14:40:23.234325965 +0800
Modify: 2023-12-28 14:40:23.234325965 +0800
Change: 2023-12-28 14:40:23.950324617 +0800

{
  "status": {
    "state": "CONTAINER_EXITED",
    "createdAt": "2023-12-28T14:25:00.479913972+08:00",
    "startedAt": "2023-12-28T14:25:01.79191972+08:00",
    "finishedAt": "2023-12-28T14:25:02.271226053+08:00",
    "exitCode": 0,
    "reason": "Completed"
  }
}

Key comparison:

finishedAt (task actually exited): 14:25:02
Status file's modify time (CRI noticed the exit): 14:40:23

A 15-minute gap in between. Kubelet's PLEG couldn't see this container's Exited state, so the pod's next step couldn't move forward.

Analysis

Step 1: read the CRI exit handler

containerd's CRI exit entrypoint is handleContainerExit in pkg/cri/server/event.go. Simplified flow:

func handleContainerExit(ctx context.Context, e *eventtypes.TaskExit, cntr containerstore.Container) error {
    // 1) Mark Stopping (avoid lock contention with update_resource — fix from the PLEG NotReady post)
    _ = cntr.Status.Update(func(status containerstore.Status) (containerstore.Status, error) {
        status.Stopping = true
        return status, nil
    })

    // 2) Get the task
    task, err := cntr.Container.Task(ctx, ...)
    if err != nil {
        if !errdefs.IsNotFound(err) {
            return fmt.Errorf("failed to load task for container: %w", err)
        }
    } else {
        // 3) task.Delete — internally does shim Delete + rootfs umount
        if _, err = task.Delete(ctx, WithNRISandboxDelete(cntr.SandboxID), containerd.WithProcessKill); err != nil {
            if !errdefs.IsNotFound(err) {
                return fmt.Errorf("failed to stop container: %w", err)
                // ↑ returns here — code below never runs
            }
        }
    }

    // 4) Update Status: Running → Exited
    err = cntr.Status.UpdateSync(func(status containerstore.Status) (containerstore.Status, error) {
        if status.FinishedAt == 0 {
            status.Pid = 0
            status.FinishedAt = e.ExitedAt.UnixNano()
            status.ExitCode = int32(e.ExitStatus)
        }
        if status.Unknown {
            status.Unknown = false
        }
        return status, nil
    })
    ...
}

The root cause is already visible: if step 3 fails, step 4 never executes. To confirm, we need timing.

Step 2: add timing across `task.Delete`

I added timing logs across the relevant paths to track which hop is slow:

// task.go: client-side view of task.Delete
st := time.Now()
r, err := t.client.TaskService().Delete(ctx, &#x26;tasks.DeleteTaskRequest{
    ContainerID: t.id,
})
logrus.Debugf("[zoe] task delete call task service delete coast %+v", time.Now().Sub(st))

// runtime/v2/manager.go: TaskManager view
func (m *TaskManager) Delete(ctx context.Context, taskID string) (*runtime.Exit, error) {
    st := time.Now()
    item, err := m.manager.shims.Get(ctx, taskID)
    log.G(ctx).Infof("[zoe] taskmanager delete get shim coast %v", time.Now().Sub(st))
    ...
    st = time.Now()
    exit, err := shimTask.delete(ctx, func(ctx context.Context, id string) {
        m.manager.shims.Delete(ctx, id)
    })
    log.G(ctx).Infof("[zoe] taskmanager delete shim task coast %v", time.Now().Sub(st))
    ...
}

// runtime/v2/runc/container.go: shim-side Container.Delete
func (c *Container) Delete(ctx context.Context, r *task.DeleteRequest) (process.Process, error) {
    st := time.Now()
    p, err := c.Process(r.ExecID)
    logrus.Infof("[zoe] runc delete container get process %s", time.Now().Sub(st))
    ...
    st = time.Now()
    if err := p.Delete(ctx); err != nil {
        logrus.Infof("[zoe] runc delete container delete process %s", time.Now().Sub(st))
        return nil, err
    }
    logrus.Infof("[zoe] runc delete container delete process %s", time.Now().Sub(st))
    ...
}

// pkg/process/init.go: Init.delete — the real work, calls runc + umount
func (p *Init) delete(ctx context.Context) error {
    waitTimeout(ctx, &#x26;p.wg, 2*time.Second)
    st := time.Now()
    err := p.runtime.Delete(ctx, p.id, nil)
    log.G(ctx).Info("[zoe] Init delete/runtime.Delete coast", time.Now().Sub(st))
    ...
    st = time.Now()
    defer func() {
        log.G(ctx).Info("[zoe] Init delete/UnmountAll coast", time.Now().Sub(st))
    }()
    if err2 := mount.UnmountAll(p.Rootfs, 0); err2 != nil {
        log.G(ctx).WithError(err2).Warn("failed to cleanup rootfs mount")
        ...
    }
    return err
}

The goal: log every hop from "client → manager → shim → init.delete → UnmountAll" with its duration. Identifies which step blocks.

Step 3: inject latency to reproduce

The natural fault is intermittent. I used strace's inject feature to add latency to umount2 directly — extremely effective for I/O-slow debugging:

# Get the shim PID
SHIM_PID=$(ps -ef | grep $(sudo crictl pods | grep node-problem | grep -w Ready | awk '{print $1}') | grep -v grep | awk '{print $2}')

# Inject 60s latency on umount2 for this shim (strace >= 4.22)
sudo strace -e trace=umount2 -f -e inject=umount2:delay_enter=60000000 -b execve -p $SHIM_PID

Then in another terminal, journalctl -u containerd -f to catch the [zoe] logs. Finally kill the container's business process to trigger exit:

sudo ps -ef | grep node-problem-detector | grep -v grep | awk '{print $2}' | xargs sudo kill

Step 4: reconcile timestamps

Captured logs (abridged):

level=debug msg="received exit event &#x26;TaskExit{ContainerID:1d586e..., ExitedAt:2023-12-28 13:18:49.167031554, ...}"
level=info  msg="[zoe] taskmanager delete shim task coast 9.991741935s"
level=debug msg="[zoe] task delete call task service delete coast 9.991791137s"
level=debug msg="failed to delete task" error="context deadline exceeded"
level=debug msg="[zoe] OnTeskExits delete task failed coast 10.000087609s"

level=info msg="[zoe] runc delete container delete process 20.029129999s"

Reading this reconstructs the full story:

13:18:49: TaskExit received, enters handleContainerExit.
CRI calls task.Delete → ttrpc to shim.
Shim in Container.Delete → Init.delete → UnmountAll(rootfs) blocks 20 seconds (the injected latency).
CRI-side ctx defaults to 10s timeout — so the client gets context deadline exceeded at 10s.
task.Delete returns error. handleContainerExit does return fmt.Errorf(...).
Status.UpdateSync never runs. The status file is still Running.
Until some compensating mechanism inside Kubelet or containerd retries — in this case, 15 minutes later.

That's how task exits at 14:25 but CRI sees Exited at 14:40.

Step 5: why umount2 is slow

Going back to the 20-second shim block: the rootfs is overlayfs, upper layer on the main data disk. Overlayfs runs a full filesystem sync on umount. Under I/O pressure or with many dirty pages, that sync takes minutes.

This is the same overlayfs sync issue as the short-task PLEG NotReady investigation — but the failure mode is different:

That one: slow umount2 → shim unresponsive → CRI Status lock unattainable → ListContainers blocks → PLEG NotReady.
This one: slow umount2 → task.Delete times out → CRI state machine short-circuits → container stuck in Running.

Same underlying slow I/O, two completely different user-visible failure modes.

Fix paths

The core conclusion is simple:

CRI must not skip the Status update when task.Delete fails.

But fixing this properly is two layers.

Mitigation: reduce umount2 I/O pressure

Direct routes:

Move to SSD to raise the I/O ceiling — ops layer.
overlayfs volatile mount: containerd ≥ 1.6.24 supports it; with volatile, umount doesn't sync. Needs kernel ≥ 5.10 (or backported 4.18).
Cap container density and short-task concurrency to prevent sustained I/O saturation.

Root fix: reorder CRI state transitions

The aspirational fix: set Status to Exited first, then wait for rootfs unmount. That way Kubelet sees the correct status regardless of umount success.

Two new problems this introduces:

Shim leak risk: if Status flips to Exited but umount fails, who cleans up the shim process? Needs a separate cleanup pipeline.
Diverges from upstream design: upstream is "Delete then Update." Changing the order needs a community issue and ongoing maintenance cost.

Stopgap: write Status even on failure

Lightest backstop: when task.Delete fails but we do have the TaskExit event, at least write FinishedAt / ExitCode into Status. Let shim cleanup happen later. Small change, high real-world value.

Appendix

Full diff (debug version, for reference only):

diff --git a/pkg/cri/server/events.go b/pkg/cri/server/events.go
@@ -384,11 +384,14 @@ func handleContainerExit(ctx context.Context, e *eventtypes.TaskExit, cntr conta
        }
    } else {
        // TODO(random-liu): [P1] This may block the loop, we may want to spawn a worker
+       st := time.Now()
        if _, err = task.Delete(ctx, WithNRISandboxDelete(cntr.SandboxID), containerd.WithProcessKill); err != nil {
            if !errdefs.IsNotFound(err) {
+               logrus.Debugf("[zoe] OnTeskExits delete task failed coast %+v", time.Now().Sub(st))
                return fmt.Errorf("failed to stop container: %w", err)
            }
+           logrus.Debugf("[zoe] OnTeskExits delete task success coast %+v", time.Now().Sub(st))
        }
    }

diff --git a/pkg/process/init.go b/pkg/process/init.go
@@ -290,7 +292,9 @@ func (p *Init) Delete(ctx context.Context) error {
 func (p *Init) delete(ctx context.Context) error {
    waitTimeout(ctx, &#x26;p.wg, 2*time.Second)
+   st := time.Now()
    err := p.runtime.Delete(ctx, p.id, nil)
+   log.G(ctx).Info("[zoe] Init delete/runtime.Delete coast", time.Now().Sub(st))
    ...
+   st = time.Now()
+   defer func() {
+       log.G(ctx).Info("[zoe] Init delete/UnmountAll coast", time.Now().Sub(st))
+   }()
    if err2 := mount.UnmountAll(p.Rootfs, 0); err2 != nil {

strace recipe to reproduce umount2 latency:

sudo strace -e trace=umount2 \
    -f -e inject=umount2:delay_enter=60000000 \
    -b execve -p $SHIM_PID

Related: short-task PLEG NotReady investigation — same family of umount / I/O issues, different failure mode.

containerd init container stuck in Running: an rootfs-umount timeout corrupts the state machine

containerd init container stuck in Running: an rootfs-umount timeout corrupts the state machine

TL;DR

Symptom

Root cause

Fix options

Background

Analysis

Step 1: read the CRI exit handler

Step 2: add timing across task.Delete

Step 3: inject latency to reproduce

Step 4: reconcile timestamps

Step 5: why umount2 is slow

Fix paths

Mitigation: reduce umount2 I/O pressure

Root fix: reorder CRI state transitions

Stopgap: write Status even on failure

Appendix

Comments

Step 2: add timing across `task.Delete`