containerd init container stuck in Running: an rootfs-umount timeout corrupts the state machine
containerd init container stuck in Running: an rootfs-umount timeout corrupts the state machine
On a staging node, an init-proxy container had clearly exited 15+ minutes ago, but from the CRI's perspective it was still Running. Every business container in that pod was waiting for the init to finish. The whole pod was wedged.
This is the same forest as our earlier short-task PLEG NotReady investigation — same slow umount2, same CRI state machine — but the failure mode is completely different.
TL;DR
Symptom
The init-proxy container actually exited at 14:25. The CRI didn't see "Exited" until 14:40 — a 15-minute delay. Business containers waited the whole time. The pod was stuck.
Root cause
On task exit, containerd's CRI plugin in handleContainerExit calls task.Delete first, then updates Status:
TaskExit event
├─► task.Delete(...) ← shim runs umount2(rootfs) here
│ ↑ umount2 times out under I/O pressure
│ └─► RPC returns "context deadline exceeded"
│
└─► return early on error → never reaches:
cntr.Status.UpdateSync(... Running → Exited ...)
In short: if task.Delete fails, Status stays at Running forever. Kubelet sees a "dead-but-not-quite-dead" zombie container.
Fix options
- Mitigation: relieve node I/O pressure — SSD for rootfs, or overlayfs
volatilemount to skip the sync. - Root fix: change CRI's state-transition order — write Status=Exited first, then wait for rootfs unmount. Has a shim-leak risk; needs careful design.
- Stopgap: when
task.Deletefails, continue with the Status update anyway. The containerd community has been discussing this.
Walking through detection → reproduction → diagnosis below.
Background
One morning, a staging-cluster node's init-proxy container was clearly hanging long after it should have exited. The user pod was stuck initializing. Ops grabbed crictl output — by eye, the business container was waiting on init-proxy, but init-proxy showed Running in CRI with a startedAt 15+ minutes in the past.
On the node, looking at containerd's status file (/media/disk1/containerd/io.containerd.grpc.v1.cri/containers/<cid>/status):
File: 'status'
Size: 159 Blocks: 8 IO Block: 4096
Access: 2023-12-28 14:40:23.234325965 +0800
Modify: 2023-12-28 14:40:23.234325965 +0800
Change: 2023-12-28 14:40:23.950324617 +0800
{
"status": {
"state": "CONTAINER_EXITED",
"createdAt": "2023-12-28T14:25:00.479913972+08:00",
"startedAt": "2023-12-28T14:25:01.79191972+08:00",
"finishedAt": "2023-12-28T14:25:02.271226053+08:00",
"exitCode": 0,
"reason": "Completed"
}
}
Key comparison:
finishedAt(task actually exited): 14:25:02- Status file's modify time (CRI noticed the exit): 14:40:23
A 15-minute gap in between. Kubelet's PLEG couldn't see this container's Exited state, so the pod's next step couldn't move forward.
Analysis
Step 1: read the CRI exit handler
containerd's CRI exit entrypoint is handleContainerExit in pkg/cri/server/event.go. Simplified flow:
func handleContainerExit(ctx context.Context, e *eventtypes.TaskExit, cntr containerstore.Container) error {
// 1) Mark Stopping (avoid lock contention with update_resource — fix from the PLEG NotReady post)
_ = cntr.Status.Update(func(status containerstore.Status) (containerstore.Status, error) {
status.Stopping = true
return status, nil
})
// 2) Get the task
task, err := cntr.Container.Task(ctx, ...)
if err != nil {
if !errdefs.IsNotFound(err) {
return fmt.Errorf("failed to load task for container: %w", err)
}
} else {
// 3) task.Delete — internally does shim Delete + rootfs umount
if _, err = task.Delete(ctx, WithNRISandboxDelete(cntr.SandboxID), containerd.WithProcessKill); err != nil {
if !errdefs.IsNotFound(err) {
return fmt.Errorf("failed to stop container: %w", err)
// ↑ returns here — code below never runs
}
}
}
// 4) Update Status: Running → Exited
err = cntr.Status.UpdateSync(func(status containerstore.Status) (containerstore.Status, error) {
if status.FinishedAt == 0 {
status.Pid = 0
status.FinishedAt = e.ExitedAt.UnixNano()
status.ExitCode = int32(e.ExitStatus)
}
if status.Unknown {
status.Unknown = false
}
return status, nil
})
...
}
The root cause is already visible: if step 3 fails, step 4 never executes. To confirm, we need timing.
Step 2: add timing across task.Delete
I added timing logs across the relevant paths to track which hop is slow:
// task.go: client-side view of task.Delete
st := time.Now()
r, err := t.client.TaskService().Delete(ctx, &tasks.DeleteTaskRequest{
ContainerID: t.id,
})
logrus.Debugf("[zoe] task delete call task service delete coast %+v", time.Now().Sub(st))
// runtime/v2/manager.go: TaskManager view
func (m *TaskManager) Delete(ctx context.Context, taskID string) (*runtime.Exit, error) {
st := time.Now()
item, err := m.manager.shims.Get(ctx, taskID)
log.G(ctx).Infof("[zoe] taskmanager delete get shim coast %v", time.Now().Sub(st))
...
st = time.Now()
exit, err := shimTask.delete(ctx, func(ctx context.Context, id string) {
m.manager.shims.Delete(ctx, id)
})
log.G(ctx).Infof("[zoe] taskmanager delete shim task coast %v", time.Now().Sub(st))
...
}
// runtime/v2/runc/container.go: shim-side Container.Delete
func (c *Container) Delete(ctx context.Context, r *task.DeleteRequest) (process.Process, error) {
st := time.Now()
p, err := c.Process(r.ExecID)
logrus.Infof("[zoe] runc delete container get process %s", time.Now().Sub(st))
...
st = time.Now()
if err := p.Delete(ctx); err != nil {
logrus.Infof("[zoe] runc delete container delete process %s", time.Now().Sub(st))
return nil, err
}
logrus.Infof("[zoe] runc delete container delete process %s", time.Now().Sub(st))
...
}
// pkg/process/init.go: Init.delete — the real work, calls runc + umount
func (p *Init) delete(ctx context.Context) error {
waitTimeout(ctx, &p.wg, 2*time.Second)
st := time.Now()
err := p.runtime.Delete(ctx, p.id, nil)
log.G(ctx).Info("[zoe] Init delete/runtime.Delete coast", time.Now().Sub(st))
...
st = time.Now()
defer func() {
log.G(ctx).Info("[zoe] Init delete/UnmountAll coast", time.Now().Sub(st))
}()
if err2 := mount.UnmountAll(p.Rootfs, 0); err2 != nil {
log.G(ctx).WithError(err2).Warn("failed to cleanup rootfs mount")
...
}
return err
}
The goal: log every hop from "client → manager → shim → init.delete → UnmountAll" with its duration. Identifies which step blocks.
Step 3: inject latency to reproduce
The natural fault is intermittent. I used strace's inject feature to add latency to umount2 directly — extremely effective for I/O-slow debugging:
# Get the shim PID
SHIM_PID=$(ps -ef | grep $(sudo crictl pods | grep node-problem | grep -w Ready | awk '{print $1}') | grep -v grep | awk '{print $2}')
# Inject 60s latency on umount2 for this shim (strace >= 4.22)
sudo strace -e trace=umount2 -f -e inject=umount2:delay_enter=60000000 -b execve -p $SHIM_PID
Then in another terminal, journalctl -u containerd -f to catch the [zoe] logs. Finally kill the container's business process to trigger exit:
sudo ps -ef | grep node-problem-detector | grep -v grep | awk '{print $2}' | xargs sudo kill
Step 4: reconcile timestamps
Captured logs (abridged):
level=debug msg="received exit event &TaskExit{ContainerID:1d586e..., ExitedAt:2023-12-28 13:18:49.167031554, ...}"
level=info msg="[zoe] taskmanager delete shim task coast 9.991741935s"
level=debug msg="[zoe] task delete call task service delete coast 9.991791137s"
level=debug msg="failed to delete task" error="context deadline exceeded"
level=debug msg="[zoe] OnTeskExits delete task failed coast 10.000087609s"
level=info msg="[zoe] runc delete container delete process 20.029129999s"
Reading this reconstructs the full story:
- 13:18:49: TaskExit received, enters
handleContainerExit. - CRI calls
task.Delete→ ttrpc to shim. - Shim in
Container.Delete → Init.delete → UnmountAll(rootfs)blocks 20 seconds (the injected latency). - CRI-side ctx defaults to 10s timeout — so the client gets
context deadline exceededat 10s. task.Deletereturns error.handleContainerExitdoesreturn fmt.Errorf(...).Status.UpdateSyncnever runs. The status file is still Running.- Until some compensating mechanism inside Kubelet or containerd retries — in this case, 15 minutes later.
That's how task exits at 14:25 but CRI sees Exited at 14:40.
Step 5: why umount2 is slow
Going back to the 20-second shim block: the rootfs is overlayfs, upper layer on the main data disk. Overlayfs runs a full filesystem sync on umount. Under I/O pressure or with many dirty pages, that sync takes minutes.
This is the same overlayfs sync issue as the short-task PLEG NotReady investigation — but the failure mode is different:
- That one: slow
umount2→ shim unresponsive → CRI Status lock unattainable → ListContainers blocks → PLEG NotReady. - This one: slow
umount2→task.Deletetimes out → CRI state machine short-circuits → container stuck in Running.
Same underlying slow I/O, two completely different user-visible failure modes.
Fix paths
The core conclusion is simple:
CRI must not skip the Status update when
task.Deletefails.
But fixing this properly is two layers.
Mitigation: reduce umount2 I/O pressure
Direct routes:
- Move to SSD to raise the I/O ceiling — ops layer.
- overlayfs
volatilemount: containerd ≥ 1.6.24 supports it; with volatile, umount doesn't sync. Needs kernel ≥ 5.10 (or backported 4.18). - Cap container density and short-task concurrency to prevent sustained I/O saturation.
Root fix: reorder CRI state transitions
The aspirational fix: set Status to Exited first, then wait for rootfs unmount. That way Kubelet sees the correct status regardless of umount success.
Two new problems this introduces:
- Shim leak risk: if Status flips to Exited but umount fails, who cleans up the shim process? Needs a separate cleanup pipeline.
- Diverges from upstream design: upstream is "Delete then Update." Changing the order needs a community issue and ongoing maintenance cost.
Stopgap: write Status even on failure
Lightest backstop: when task.Delete fails but we do have the TaskExit event, at least write FinishedAt / ExitCode into Status. Let shim cleanup happen later. Small change, high real-world value.
Appendix
Full diff (debug version, for reference only):
diff --git a/pkg/cri/server/events.go b/pkg/cri/server/events.go
@@ -384,11 +384,14 @@ func handleContainerExit(ctx context.Context, e *eventtypes.TaskExit, cntr conta
}
} else {
// TODO(random-liu): [P1] This may block the loop, we may want to spawn a worker
+ st := time.Now()
if _, err = task.Delete(ctx, WithNRISandboxDelete(cntr.SandboxID), containerd.WithProcessKill); err != nil {
if !errdefs.IsNotFound(err) {
+ logrus.Debugf("[zoe] OnTeskExits delete task failed coast %+v", time.Now().Sub(st))
return fmt.Errorf("failed to stop container: %w", err)
}
+ logrus.Debugf("[zoe] OnTeskExits delete task success coast %+v", time.Now().Sub(st))
}
}
diff --git a/pkg/process/init.go b/pkg/process/init.go
@@ -290,7 +292,9 @@ func (p *Init) Delete(ctx context.Context) error {
func (p *Init) delete(ctx context.Context) error {
waitTimeout(ctx, &p.wg, 2*time.Second)
+ st := time.Now()
err := p.runtime.Delete(ctx, p.id, nil)
+ log.G(ctx).Info("[zoe] Init delete/runtime.Delete coast", time.Now().Sub(st))
...
+ st = time.Now()
+ defer func() {
+ log.G(ctx).Info("[zoe] Init delete/UnmountAll coast", time.Now().Sub(st))
+ }()
if err2 := mount.UnmountAll(p.Rootfs, 0); err2 != nil {
strace recipe to reproduce umount2 latency:
sudo strace -e trace=umount2 \
-f -e inject=umount2:delay_enter=60000000 \
-b execve -p $SHIM_PID
Related: short-task PLEG NotReady investigation — same family of umount / I/O issues, different failure mode.

Written by
Zoe
AI Infra Engineer · LLM Serving · GPU/RDMA · indie hacker, obsessed with shipping tools