TL;DR
SPECTRE is a novel hybrid speculative decoding framework that enhances large-model inference efficiency by leveraging underutilized tail-model services through parallel draft generation and verification techniques.
Contribution
It introduces a hybrid parallel speculative decoding strategy with priority scheduling and prompt compression, significantly improving throughput in multi-model LLM serving systems.
Findings
SPECTRE achieves up to 2.28× speedup over autoregressive decoding.
It provides up to 66% relative improvement over existing speculative decoding baselines.
SPECTRE maintains minor interference with tail-model workloads.
Abstract
LLM serving platforms are increasingly deployed as multi-model cloud systems, where user demand is often long-tailed: a few popular large models receive most requests, while many smaller tail models remain underutilized. We propose \textbf{SPECTRE} (Parallel \textbf{SPEC}ulative Decoding with a Multi-\textbf{T}enant \textbf{RE}mote Drafter), a serving framework that reuses underutilized tail-model services as remote drafters for heavily loaded large-model services through speculative decoding. SPECTRE enables draft generation and target-side verification to run in parallel, and makes such parallelism effective through three techniques: a hybrid ordinary-parallel speculative decoding strategy guided by a threshold derived from throughput analysis, speculative priority scheduling to preserve draft--target overlap under multi-tenant traffic, and draft-side prompt compression to reduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
