Running LLM inference on K8s: auto-injecting RDMA permissions and GPU-NIC topology affinity

7 min read

Running LLM inference on K8s: auto-injecting RDMA permissions and GPU-NIC topology affinity

If you're running distributed LLM inference or training on Kubernetes, RDMA networking is one of those problems everybody underestimates. This post introduces k8s-rdma-device-plugin, an open-source project that solves two of the biggest pain points: auto-injecting RDMA device permissions into containers, and matching the right RDMA NIC to the right GPUs via PCIe topology.

Why RDMA matters more than ever in AI inference

As LLM serving moves toward multi-GPU TP (Tensor Parallelism), multi-node PP (Pipeline Parallelism), and P/D disaggregation (Prefill–Decode split), GPU-to-GPU bandwidth becomes the bottleneck.

  • vLLM is heavily pushing Large Scale Serving in its 2026 Q1 roadmap — GB200, Wide EP, P/D disagg.
  • SGLang's HiCache + Mooncake moves KV cache over RDMA to enable P/D disagg.
  • DeepSpeed / Megatron-LM depend on GPUDirect RDMA for gradient sync during training.

NCCL automatically uses RDMA when it detects an InfiniBand/RoCE NIC. Throughput jumps from ~10 Gbps over TCP to 200–400 Gbps over IB. But only if the container can actually access /dev/infiniband/*.

The three pain points of managing RDMA on K8s

1. The container has no device permissions

By default, K8s pods can't touch the host's /dev/infiniband/ device nodes. NCCL silently falls back to TCP. Throughput tanks but nothing errors out — which is the worst kind of failure, because nobody knows they're running on TCP.

The common "fixes" are ugly:

# ❌ Option 1: privileged mode (too dangerous)
securityContext:
  privileged: true

# ❌ Option 2: manual hostPath (fragile, non-portable)
volumes:
  - name: infiniband
    hostPath:
      path: /dev/infiniband

Privileged mode hands the container full host access. HostPath mounts break the moment node hardware differs.

2. Picking the wrong RDMA NIC

A typical 8-GPU box has multiple RDMA NICs (e.g. 4× ConnectX-7), each physically wired to specific GPUs via a PCIe switch. If GPU 0's traffic ends up on the NIC closest to GPU 7, the packets cross NUMA boundaries — sometimes even the PCIe root complex — and latency spikes.

The manual fix is NCCL_IB_HCA=mlx5_0, but that requires knowing each node's PCIe topology by hand. It does not scale.

3. No resource accounting

The K8s scheduler has no idea whether a node has RDMA NICs, how many, or how much capacity. In multi-tenant clusters, that means no real isolation or admission control.

Existing options and why they fall short

Solution Resource accounting Device permission injection GPU-NIC affinity No privileged
Mellanox k8s-rdma-shared-dev-plugin
SR-IOV network device plugin ✅ (VFs) partial
Manual hostPath + privileged ✅ (sledgehammer)

Mellanox's shared-plugin is the closest existing solution but only counts resources — it doesn't inject permissions. You get rdma/hca_shared_devices: 1 but no /dev/infiniband/uverbs0 inside the container.

SR-IOV plugin targets network virtualization (VF allocation), needs Multus CNI, and is overkill for pure RDMA workloads.

The solution: k8s-rdma-device-plugin

I built one project that combines all three capabilities:

┌──────────────────────────────────────────────────────┐
│         k8s-rdma-device-plugin (DaemonSet)           │
│                                                      │
│  ┌─────────────────┐  ┌────────────────────────────┐ │
│  │  Device Plugin   │  │  NRI Plugin                │ │
│  │                  │  │                            │ │
│  │  Reports virtual │  │  • Auto-injects RDMA       │ │
│  │  RDMA resources  │  │    device permissions      │ │
│  │  to kubelet      │  │  • Annotation-level        │ │
│  │  (toggleable)    │  │    fine-grained control    │ │
│  │                  │  │  • Auto GPU-NIC affinity   │ │
│  └─────────────────┘  └────────────────────────────┘ │
└──────────────────────────────────────────────────────┘

Capability 1: device plugin resource reporting

Registers a virtual RDMA resource (default rdma.io/hca, count configurable) with kubelet via the standard Device Plugin Framework. Lets the scheduler reason about RDMA.

A single RDMA NIC has no hardware-level isolation, so the "virtual resource" is fungible — each allocation just says "this pod needs RDMA access," not "this pod owns NIC #3."

Set --enable-device-plugin=false if you only want the NRI injection.

Capability 2: NRI device permission injection

Uses the containerd NRI (Node Resource Interface) hook to inject RDMA device nodes at container creation:

  • Global: /dev/infiniband/rdma_cm
  • Per-NIC: /dev/infiniband/uverbs*, /dev/infiniband/umad*, /dev/infiniband/issm*

NRI is a containerd-native plugin mechanism. No CNI changes, no Multus, no privileged containers.

It also supports fine-grained control through annotations:

metadata:
  annotations:
    # Inject for all containers in the pod
    devices.nri.io/pod: |
      - path: /dev/infiniband/uverbs0
        type: c
        major: 231
        minor: 0
    # Inject for a specific container
    devices.nri.io/container.myapp: |
      - path: /dev/infiniband/uverbs1
        type: c
        major: 231
        minor: 1

Capability 3: GPU-NIC PCIe topology affinity

This is the headline feature. With gpuRdmaAutoInject enabled:

  1. Detect the container's NVIDIA_VISIBLE_DEVICES to learn which GPUs it owns
  2. Enumerate GPU PCI BDFs from /sys/bus/pci/drivers/nvidia/
  3. Enumerate RDMA device PCI BDFs from /sys/class/infiniband/
  4. Match GPUs to NICs by PCIe topology:
    • Same PCIe root complex (best — same switch, lowest latency)
    • Same NUMA node (fallback — same memory domain)
  5. Inject the matched NIC into the container

Visualize it on a typical 8×H100 + 4×CX-7 machine:

PCIe Root 0          PCIe Root 1
├── GPU 0             ├── GPU 4
├── GPU 1             ├── GPU 5
├── mlx5_0 ◄──┐      ├── mlx5_2 ◄──┐
├── GPU 2     │      ├── GPU 6     │
├── GPU 3     │      ├── GPU 7     │
├── mlx5_1    │      └── mlx5_3    │
              │                    │
Pod asks for GPU 0,1 → mlx5_0  ←──┘
Pod asks for GPU 4,5 ─────────────── → mlx5_2

Key point: this works for any GPU container with NVIDIA_VISIBLE_DEVICES set. The pod doesn't need to explicitly request rdma.io/hca. Pure GPU inference pods get the right NIC automatically.

Real-world usage

vLLM multi-GPU inference

apiVersion: v1
kind: Pod
spec:
  containers:
    - name: vllm
      image: vllm/vllm-openai:latest
      args: ["--model", "deepseek-ai/DeepSeek-V3", "--tensor-parallel-size", "8"]
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "0,1,2,3,4,5,6,7"
      resources:
        limits:
          nvidia.com/gpu: "8"
  # RDMA devices auto-injected based on PCIe topology!

SGLang P/D disaggregation

No need to manually pass --disaggregation-ib-device mlx5_1 — the right RDMA NIC is already in the container.

apiVersion: v1
kind: Pod
spec:
  containers:
    - name: sglang-prefill
      image: lmsysorg/sglang:latest
      command: ["python", "-m", "sglang.launch_server"]
      args:
        - "--model-path"
        - "meta-llama/Llama-3-70B"
        - "--tp"
        - "4"
        - "--disaggregation-mode"
        - "prefill"
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "0,1,2,3"
      resources:
        limits:
          nvidia.com/gpu: "4"

Inject-only mode (no kubelet accounting)

If you just want auto-injection without resource accounting:

helm install rdma-device-plugin ./deploy/charts \
  --namespace kube-system \
  --set rdma.enableDevicePlugin=false \
  --set gpuRdmaAutoInject=true

Deployment

Prereqs

  • Kubernetes 1.26+
  • containerd with NRI enabled (enable_nri = true)
  • Mellanox/NVIDIA ConnectX-series RDMA NICs
  • NVIDIA GPU drivers (for affinity mode)

Helm install

helm install rdma-device-plugin ./deploy/charts \
  --namespace kube-system \
  --set rdma.resourceName="rdma.io/hca" \
  --set rdma.resourceCount=100 \
  --set gpuRdmaAutoInject=true

Configuration

Precedence: CLI flags > env vars > config file > defaults

Env var Description Default
RDMA_ENABLE_DEVICE_PLUGIN Enable device plugin true
RDMA_RESOURCE_NAME Resource name rdma.io/hca
RDMA_RESOURCE_COUNT Virtual resource count 100
RDMA_GPU_AUTO_INJECT GPU-NIC auto-inject false

Implementation notes

Why NRI over an Admission Webhook

Admission webhooks can only mutate the Pod spec (add volumes / mounts). They can't directly operate on the container's Linux device cgroup. NRI runs inside containerd's container lifecycle hooks, where we can inject a LinuxDevice with precise major/minor and permissions.

PCIe topology discovery via sysfs

We read topology purely from sysfs — no user-space tools required:

/sys/bus/pci/drivers/nvidia/       → GPU PCI BDF list
/sys/class/infiniband/<dev>/device → RDMA device PCI BDF (symlink)
/sys/bus/pci/devices/<BDF>/        → numa_node, device path (incl. PCIe root)

GPUs are enumerated by PCI BDF order, matching NVIDIA's GPU index assignment.

Why single-NIC isn't isolated

RDMA NICs in shared mode don't support hardware-level container isolation — multiple containers can share IB ports on the same NIC. So device plugin reports virtual slots; NRI injects the same device nodes. This is fine for AI training/inference, where a node typically runs one or two heavy workloads, not hundreds of tiny tenants.

Project

Issues and PRs welcome. If this saves you a debugging session, a ⭐ is appreciated.

Zoe

Written by

Zoe

AI Infra Engineer · LLM Serving · GPU/RDMA · indie hacker, obsessed with shipping tools

Comments