SIMPLE: Disaggregating Sampling from GPU Inference into a Decision Plane for Faster Distributed LLM Serving
Bohan Zhao, Zane Cao, Yongchao He

TL;DR
SIMPLE disaggregates sampling from GPU inference into a CPU-side service, enabling faster distributed LLM serving by reducing bottlenecks and improving throughput and latency without requiring user code changes.
Contribution
It introduces a stage-agnostic, sequence-parallel, overlappable decision plane that disaggregates sampling, combining novel CPU algorithms and hot-vocab sampling for improved performance.
Findings
Up to 96% increase in end-to-end throughput
P95 latency reduced by 20-65%
No user-side code changes required
Abstract
As large language models (LLMs) scale out with tensor parallelism (TP) and pipeline parallelism (PP) and production stacks have aggressively optimized the data plane (attention/GEMM and KV cache), sampling, the decision plane that turns logits into tokens, becomes a new bottleneck. This creates a structural holdout: sampling neither expands with TP nor balances across PP stages, so its share of iteration time grows as GPUs get faster and it caps pipeline frequency at the last stage. We present SIMPLE, a stage-agnostic, sequence-parallel, overlappable decision plane that disaggregates sampling into a CPU-side service and shrinks its runtime footprint back to a minor, hidden role. SIMPLE combines: (1) sequence-parallel sampling, which shards work along the batch dimension and removes vocabulary-axis collectives; (2) a CPU-based algorithm with column-wise penalties and truncation-first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Advanced Neural Network Applications
