ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios

Xinyi Hu; Yuhao Shen; Baolin Zhang; Hengxin Zhang; Jun Dai; Shuang Ge; Lei Chen; Yue Li; Mingcheng Wan

arXiv:2604.09603·cs.DC·May 15, 2026

ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios

Xinyi Hu, Yuhao Shen, Baolin Zhang, Hengxin Zhang, Jun Dai, Shuang Ge, Lei Chen, Yue Li, Mingcheng Wan

PDF

TL;DR

ECHO is a high-concurrency framework that improves speculative decoding efficiency for large language models by using sparse gating and elastic scheduling, achieving significant speedups in production scenarios.

Contribution

ECHO introduces a novel elastic speculative decoding approach with sparse confidence gating, optimizing verification and execution trade-offs in high-concurrency environments.

Findings

01

ECHO achieves up to 5.35x walltime speedup over state-of-the-art methods.

02

ECHO outperforms existing methods across diverse model scales, especially in industrial-grade models.

03

ECHO effectively manages the verification and execution trade-off, improving efficiency in high-load scenarios.

Abstract

Speculative Decoding promises to accelerate the inference of Large Language Models, yet its efficacy often degrades in production-grade serving. Existing evaluations typically overlook the compute-bound nature of high-concurrency regimes, where verification compute becomes the dominant bottleneck. Consequently, prior methods face a dilemma: static trees incur massive verification waste, while dynamic trees suffer from cumulative misjudgments and kernel incompatibility. To bridge this gap, we introduce ECHO, a high concurrency-oriented framework integrated into SGLang that reformulates speculative execution as a budgeted scheduling problem. Crucially, ECHO employs sparse confidence gating to manage the batch as a unified super-tree, elastically pivoting budget between depth and width to co-optimize the trade-off between reducing global verification steps and maximizing per-step…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.