StarSD: One-for-Many Speculative Decoding
Junhao He, Feiran You, Hongyang Du

TL;DR
StarSD introduces a scalable speculative decoding framework that efficiently utilizes a single draft model across multiple target models in distributed settings, improving inference speed and resource utilization for large language models.
Contribution
StarSD presents a novel one-for-many speculative decoding framework that enables efficient multi-model serving across distributed accelerators, addressing scalability and resource sharing challenges.
Findings
Achieves predictable latency and high utilization in distributed inference.
Supports flexible resource allocation across heterogeneous accelerators.
Maintains output quality while accelerating autoregressive generation.
Abstract
Speculative decoding accelerates autoregressive generation by separating token proposal from verification, but most existing approaches are designed for single-node execution and do not scale well to multi-accelerator clusters used for serving modern Large Language Models (LLMs). We present StarSD, a one-for-many speculative decoding framework that uses a single draft model to serve multiple target models across distributed nodes via a star topology. StarSD decouples drafting and verification, enabling effective sharing of draft computation, and preventing distributed accelerators from remaining idle under bursty workloads. We provide a system-level analysis that characterizes when and why a single draft model can remain fully utilized by multiple verifiers, yielding predictable latency and utilization gains. Extensive experiments in real-world distributed inference settings demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · Scientific Computing and Data Management
