ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios
Xinyi Hu, Yuhao Shen, Baolin Zhang, Hengxin Zhang, Jun Dai, Shuang Ge, Lei Chen, Yue Li, Mingcheng Wan

TL;DR
ECHO is a high-concurrency framework that improves speculative decoding efficiency for large language models by using sparse gating and elastic scheduling, achieving significant speedups in production scenarios.
Contribution
ECHO introduces a novel elastic speculative decoding approach with sparse confidence gating, optimizing verification and execution trade-offs in high-concurrency environments.
Findings
ECHO achieves up to 5.35x walltime speedup over state-of-the-art methods.
ECHO outperforms existing methods across diverse model scales, especially in industrial-grade models.
ECHO effectively manages the verification and execution trade-off, improving efficiency in high-load scenarios.
Abstract
Speculative Decoding promises to accelerate the inference of Large Language Models, yet its efficacy often degrades in production-grade serving. Existing evaluations typically overlook the compute-bound nature of high-concurrency regimes, where verification compute becomes the dominant bottleneck. Consequently, prior methods face a dilemma: static trees incur massive verification waste, while dynamic trees suffer from cumulative misjudgments and kernel incompatibility. To bridge this gap, we introduce ECHO, a high concurrency-oriented framework integrated into SGLang that reformulates speculative execution as a budgeted scheduling problem. Crucially, ECHO employs sparse confidence gating to manage the batch as a unified super-tree, elastically pivoting budget between depth and width to co-optimize the trade-off between reducing global verification steps and maximizing per-step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
