SIMPLE: Disaggregating Sampling from GPU Inference into a Decision Plane for Faster Distributed LLM Serving

Bohan Zhao; Zane Cao; Yongchao He

arXiv:2512.00719·cs.DC·December 2, 2025

SIMPLE: Disaggregating Sampling from GPU Inference into a Decision Plane for Faster Distributed LLM Serving

Bohan Zhao, Zane Cao, Yongchao He

PDF

Open Access

TL;DR

SIMPLE disaggregates sampling from GPU inference into a CPU-side service, enabling faster distributed LLM serving by reducing bottlenecks and improving throughput and latency without requiring user code changes.

Contribution

It introduces a stage-agnostic, sequence-parallel, overlappable decision plane that disaggregates sampling, combining novel CPU algorithms and hot-vocab sampling for improved performance.

Findings

01

Up to 96% increase in end-to-end throughput

02

P95 latency reduced by 20-65%

03

No user-side code changes required

Abstract

As large language models (LLMs) scale out with tensor parallelism (TP) and pipeline parallelism (PP) and production stacks have aggressively optimized the data plane (attention/GEMM and KV cache), sampling, the decision plane that turns logits into tokens, becomes a new bottleneck. This creates a structural holdout: sampling neither expands with TP nor balances across PP stages, so its share of iteration time grows as GPUs get faster and it caps pipeline frequency at the last stage. We present SIMPLE, a stage-agnostic, sequence-parallel, overlappable decision plane that disaggregates sampling into a CPU-side service and shrinks its runtime footprint back to a minor, hidden role. SIMPLE combines: (1) sequence-parallel sampling, which shards work along the batch dimension and removes vocabulary-axis collectives; (2) a CPU-based algorithm with column-wise penalties and truncation-first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Advanced Neural Network Applications