StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving
Satyam Kumar, Arpit Singh Gautam, Kailash Talreja, Saurabh Jha

TL;DR
StreamServe is a novel disaggregated LLM serving architecture that adaptively balances throughput and latency using metric-aware routing and online speculation tuning, significantly improving efficiency.
Contribution
It introduces a new architecture combining adaptive routing and speculative decoding for low-latency, high-throughput LLM serving, with components that optimize request orchestration and execution.
Findings
Reduces latency by 11 to 18 times compared to tensor parallel vLLM baselines.
Achieves up to 2235 tokens/sec throughput on summarization tasks.
Maintains stable time per output token across configurations.
Abstract
Efficient LLM serving must balance throughput and latency across diverse, bursty workloads. We introduce StreamServe, a disaggregated prefill decode serving architecture that combines metric aware routing across compute lanes with adaptive speculative decoding that tunes speculation depth online from runtime signals. StreamServe comprises four components: StreamScheduler for request orchestration, FlowGuard for multi signal routing, PipeServe Engine for disaggregated prefill decode execution on multi GPU, and SpecuStream for runtime adaptive speculation. We evaluate StreamServe on four benchmarks ALPACA, GSM8K, HUMANEVAL, and SUM with 80 queries each and 320 total using 4 A800 40GB GPUs configured as two stream pairs. Across these workloads, StreamServe reduces latency by 11 to 18 times relative to tensor parallel vLLM baselines and reaches throughput up to 2235 tokens per second on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
