SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

Jincheng Xie; Yawen Ling; Qi Xiao; Feiyu Zhang; Zhongyi Huang; Wen Hu; Yu Zheng

arXiv:2605.08151·cs.DC·May 13, 2026

SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

Jincheng Xie, Yawen Ling, Qi Xiao, Feiyu Zhang, Zhongyi Huang, Wen Hu, Yu Zheng

PDF

1 Repo

TL;DR

SPECTRE is a novel hybrid speculative decoding framework that enhances large-model inference efficiency by leveraging underutilized tail-model services through parallel draft generation and verification techniques.

Contribution

It introduces a hybrid parallel speculative decoding strategy with priority scheduling and prompt compression, significantly improving throughput in multi-model LLM serving systems.

Findings

01

SPECTRE achieves up to 2.28× speedup over autoregressive decoding.

02

It provides up to 66% relative improvement over existing speculative decoding baselines.

03

SPECTRE maintains minor interference with tail-model workloads.

Abstract

LLM serving platforms are increasingly deployed as multi-model cloud systems, where user demand is often long-tailed: a few popular large models receive most requests, while many smaller tail models remain underutilized. We propose \textbf{SPECTRE} (Parallel \textbf{SPEC}ulative Decoding with a Multi-\textbf{T}enant \textbf{RE}mote Drafter), a serving framework that reuses underutilized tail-model services as remote drafters for heavily loaded large-model services through speculative decoding. SPECTRE enables draft generation and target-side verification to run in parallel, and makes such parallelism effective through three techniques: a hybrid ordinary-parallel speculative decoding strategy guided by a threshold derived from throughput analysis, speculative priority scheduling to preserve draft--target overlap under multi-tenant traffic, and draft-side prompt compression to reduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sgl-project/sglang/pull/22272
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.