Running LLM inference on K8s: auto-injecting RDMA permissions and GPU-NIC topology affinity
Running LLM inference on K8s: auto-injecting RDMA permissions and GPU-NIC topology affinity
If you're running distributed LLM inference or training on Kubernetes, RDMA networking is one of those problems everybody underestimates. This post introduces k8s-rdma-device-plugin, an open-source project that solves two of the biggest pain points: auto-injecting RDMA device permissions into containers, and matching the right RDMA NIC to the right GPUs via PCIe topology.
Why RDMA matters more than ever in AI inference
As LLM serving moves toward multi-GPU TP (Tensor Parallelism), multi-node PP (Pipeline Parallelism), and P/D disaggregation (Prefill–Decode split), GPU-to-GPU bandwidth becomes the bottleneck.
- vLLM is heavily pushing Large Scale Serving in its 2026 Q1 roadmap — GB200, Wide EP, P/D disagg.
- SGLang's HiCache + Mooncake moves KV cache over RDMA to enable P/D disagg.
- DeepSpeed / Megatron-LM depend on GPUDirect RDMA for gradient sync during training.
NCCL automatically uses RDMA when it detects an InfiniBand/RoCE NIC. Throughput jumps from ~10 Gbps over TCP to 200–400 Gbps over IB. But only if the container can actually access /dev/infiniband/*.
The three pain points of managing RDMA on K8s
1. The container has no device permissions
By default, K8s pods can't touch the host's /dev/infiniband/ device nodes. NCCL silently falls back to TCP. Throughput tanks but nothing errors out — which is the worst kind of failure, because nobody knows they're running on TCP.
The common "fixes" are ugly:
# ❌ Option 1: privileged mode (too dangerous)
securityContext:
privileged: true
# ❌ Option 2: manual hostPath (fragile, non-portable)
volumes:
- name: infiniband
hostPath:
path: /dev/infiniband
Privileged mode hands the container full host access. HostPath mounts break the moment node hardware differs.
2. Picking the wrong RDMA NIC
A typical 8-GPU box has multiple RDMA NICs (e.g. 4× ConnectX-7), each physically wired to specific GPUs via a PCIe switch. If GPU 0's traffic ends up on the NIC closest to GPU 7, the packets cross NUMA boundaries — sometimes even the PCIe root complex — and latency spikes.
The manual fix is NCCL_IB_HCA=mlx5_0, but that requires knowing each node's PCIe topology by hand. It does not scale.
3. No resource accounting
The K8s scheduler has no idea whether a node has RDMA NICs, how many, or how much capacity. In multi-tenant clusters, that means no real isolation or admission control.
Existing options and why they fall short
| Solution | Resource accounting | Device permission injection | GPU-NIC affinity | No privileged |
|---|---|---|---|---|
| Mellanox k8s-rdma-shared-dev-plugin | ✅ | ❌ | ❌ | ❌ |
| SR-IOV network device plugin | ✅ (VFs) | ❌ | ❌ | partial |
| Manual hostPath + privileged | ❌ | ✅ (sledgehammer) | ❌ | ❌ |
Mellanox's shared-plugin is the closest existing solution but only counts resources — it doesn't inject permissions. You get rdma/hca_shared_devices: 1 but no /dev/infiniband/uverbs0 inside the container.
SR-IOV plugin targets network virtualization (VF allocation), needs Multus CNI, and is overkill for pure RDMA workloads.
The solution: k8s-rdma-device-plugin
I built one project that combines all three capabilities:
┌──────────────────────────────────────────────────────┐
│ k8s-rdma-device-plugin (DaemonSet) │
│ │
│ ┌─────────────────┐ ┌────────────────────────────┐ │
│ │ Device Plugin │ │ NRI Plugin │ │
│ │ │ │ │ │
│ │ Reports virtual │ │ • Auto-injects RDMA │ │
│ │ RDMA resources │ │ device permissions │ │
│ │ to kubelet │ │ • Annotation-level │ │
│ │ (toggleable) │ │ fine-grained control │ │
│ │ │ │ • Auto GPU-NIC affinity │ │
│ └─────────────────┘ └────────────────────────────┘ │
└──────────────────────────────────────────────────────┘
Capability 1: device plugin resource reporting
Registers a virtual RDMA resource (default rdma.io/hca, count configurable) with kubelet via the standard Device Plugin Framework. Lets the scheduler reason about RDMA.
A single RDMA NIC has no hardware-level isolation, so the "virtual resource" is fungible — each allocation just says "this pod needs RDMA access," not "this pod owns NIC #3."
Set --enable-device-plugin=false if you only want the NRI injection.
Capability 2: NRI device permission injection
Uses the containerd NRI (Node Resource Interface) hook to inject RDMA device nodes at container creation:
- Global:
/dev/infiniband/rdma_cm - Per-NIC:
/dev/infiniband/uverbs*,/dev/infiniband/umad*,/dev/infiniband/issm*
NRI is a containerd-native plugin mechanism. No CNI changes, no Multus, no privileged containers.
It also supports fine-grained control through annotations:
metadata:
annotations:
# Inject for all containers in the pod
devices.nri.io/pod: |
- path: /dev/infiniband/uverbs0
type: c
major: 231
minor: 0
# Inject for a specific container
devices.nri.io/container.myapp: |
- path: /dev/infiniband/uverbs1
type: c
major: 231
minor: 1
Capability 3: GPU-NIC PCIe topology affinity
This is the headline feature. With gpuRdmaAutoInject enabled:
- Detect the container's
NVIDIA_VISIBLE_DEVICESto learn which GPUs it owns - Enumerate GPU PCI BDFs from
/sys/bus/pci/drivers/nvidia/ - Enumerate RDMA device PCI BDFs from
/sys/class/infiniband/ - Match GPUs to NICs by PCIe topology:
- Same PCIe root complex (best — same switch, lowest latency)
- Same NUMA node (fallback — same memory domain)
- Inject the matched NIC into the container
Visualize it on a typical 8×H100 + 4×CX-7 machine:
PCIe Root 0 PCIe Root 1
├── GPU 0 ├── GPU 4
├── GPU 1 ├── GPU 5
├── mlx5_0 ◄──┐ ├── mlx5_2 ◄──┐
├── GPU 2 │ ├── GPU 6 │
├── GPU 3 │ ├── GPU 7 │
├── mlx5_1 │ └── mlx5_3 │
│ │
Pod asks for GPU 0,1 → mlx5_0 ←──┘
Pod asks for GPU 4,5 ─────────────── → mlx5_2
Key point: this works for any GPU container with NVIDIA_VISIBLE_DEVICES set. The pod doesn't need to explicitly request rdma.io/hca. Pure GPU inference pods get the right NIC automatically.
Real-world usage
vLLM multi-GPU inference
apiVersion: v1
kind: Pod
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args: ["--model", "deepseek-ai/DeepSeek-V3", "--tensor-parallel-size", "8"]
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0,1,2,3,4,5,6,7"
resources:
limits:
nvidia.com/gpu: "8"
# RDMA devices auto-injected based on PCIe topology!
SGLang P/D disaggregation
No need to manually pass --disaggregation-ib-device mlx5_1 — the right RDMA NIC is already in the container.
apiVersion: v1
kind: Pod
spec:
containers:
- name: sglang-prefill
image: lmsysorg/sglang:latest
command: ["python", "-m", "sglang.launch_server"]
args:
- "--model-path"
- "meta-llama/Llama-3-70B"
- "--tp"
- "4"
- "--disaggregation-mode"
- "prefill"
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0,1,2,3"
resources:
limits:
nvidia.com/gpu: "4"
Inject-only mode (no kubelet accounting)
If you just want auto-injection without resource accounting:
helm install rdma-device-plugin ./deploy/charts \
--namespace kube-system \
--set rdma.enableDevicePlugin=false \
--set gpuRdmaAutoInject=true
Deployment
Prereqs
- Kubernetes 1.26+
- containerd with NRI enabled (
enable_nri = true) - Mellanox/NVIDIA ConnectX-series RDMA NICs
- NVIDIA GPU drivers (for affinity mode)
Helm install
helm install rdma-device-plugin ./deploy/charts \
--namespace kube-system \
--set rdma.resourceName="rdma.io/hca" \
--set rdma.resourceCount=100 \
--set gpuRdmaAutoInject=true
Configuration
Precedence: CLI flags > env vars > config file > defaults
| Env var | Description | Default |
|---|---|---|
RDMA_ENABLE_DEVICE_PLUGIN |
Enable device plugin | true |
RDMA_RESOURCE_NAME |
Resource name | rdma.io/hca |
RDMA_RESOURCE_COUNT |
Virtual resource count | 100 |
RDMA_GPU_AUTO_INJECT |
GPU-NIC auto-inject | false |
Implementation notes
Why NRI over an Admission Webhook
Admission webhooks can only mutate the Pod spec (add volumes / mounts). They can't directly operate on the container's Linux device cgroup. NRI runs inside containerd's container lifecycle hooks, where we can inject a LinuxDevice with precise major/minor and permissions.
PCIe topology discovery via sysfs
We read topology purely from sysfs — no user-space tools required:
/sys/bus/pci/drivers/nvidia/ → GPU PCI BDF list
/sys/class/infiniband/<dev>/device → RDMA device PCI BDF (symlink)
/sys/bus/pci/devices/<BDF>/ → numa_node, device path (incl. PCIe root)
GPUs are enumerated by PCI BDF order, matching NVIDIA's GPU index assignment.
Why single-NIC isn't isolated
RDMA NICs in shared mode don't support hardware-level container isolation — multiple containers can share IB ports on the same NIC. So device plugin reports virtual slots; NRI injects the same device nodes. This is fine for AI training/inference, where a node typically runs one or two heavy workloads, not hundreds of tiny tenants.
Project
- GitHub: https://github.com/jiusanzhou/k8s-rdma-device-plugin
- License: Apache 2.0
Issues and PRs welcome. If this saves you a debugging session, a ⭐ is appreciated.

Written by
Zoe
AI Infra Engineer · LLM Serving · GPU/RDMA · indie hacker, obsessed with shipping tools