Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC

Mohammad Siavashi; Mariano Scazzariello; Gerald Q. Maguire Jr.; Dejan Kosti\'c; Marco Chiesa

arXiv:2604.07609·cs.DC·April 10, 2026

Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC

Mohammad Siavashi, Mariano Scazzariello, Gerald Q. Maguire Jr., Dejan Kosti\'c, Marco Chiesa

PDF

TL;DR

Blink introduces a novel LLM inference architecture that eliminates CPU involvement by leveraging SmartNIC and GPU, significantly enhancing performance and stability under interference.

Contribution

It proposes a CPU-free inference stack for LLMs by offloading request handling to SmartNIC and managing batching and scheduling on GPU, improving efficiency and robustness.

Findings

01

Outperforms baselines in latency and throughput metrics.

02

Maintains stable performance under CPU interference.

03

Reduces energy consumption per token.

Abstract

Large Language Model (LLM) inference is rapidly becoming a core datacenter service, yet current serving stacks keep the host CPU on the critical path for orchestration and token-level control. This makes LLM performance sensitive to CPU interference, undermining application colocation and forcing operators to reserve CPU headroom, leaving substantial capacity unutilized. We introduce Blink, an end-to-end serving architecture that removes the host CPU from the steady-state inference path by redistributing responsibilities across a SmartNIC and a GPU. Blink offloads request handling to the SmartNIC, which delivers inputs directly into GPU memory via RDMA, and replaces host-driven scheduling with a persistent GPU kernel that performs batching, scheduling, and KV-cache management without CPU involvement. Evaluated against TensorRT-LLM, vLLM, and SGLang, Blink outperforms all baselines…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.