Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

Yuhao Shen; Tianyu Liu; Junyi Shen; Jinyang Wu; Quan Kong; Li Huan; Cong Wang

arXiv:2601.05524·cs.CL·April 15, 2026

Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

Yuhao Shen, Tianyu Liu, Junyi Shen, Jinyang Wu, Quan Kong, Li Huan, Cong Wang

PDF

TL;DR

Double introduces a novel retrieval speculative parallelism framework that surpasses traditional acceleration limits in language models, achieving significant speedups without additional training.

Contribution

It proposes a training-free, lossless method that breaks the theoretical speedup ceiling of speculative decoding through iterative retrieval and authoritative guidance.

Findings

01

Achieves 5.3x speedup on LLaMA3.3-70B

02

Achieves 2.8x speedup on Qwen3-32B

03

Outperforms EAGLE-3 in speed without extra training

Abstract

Parallel Speculative Decoding (PSD) accelerates traditional Speculative Decoding (SD) by overlapping draft generation with verification. However, it remains hampered by two fundamental challenges: (1) a theoretical speedup ceiling dictated by the speed ratio between the draft and target models, and (2) high computational waste and pipeline stall due to mid-sequence token rejections of early errors. To address these limitations, we introduce \textsc{Double} (Double Retrieval Speculative Parallelism). By bridging the gap between SD and PSD, our framework resolves the Retrieval \emph{Precision-Efficiency Dilemma} through a novel synchronous mechanism. Specifically, we enable the draft model to execute iterative retrieval speculations to break the theoretical speedup limits; to alleviate rejections without rollback, the target model performs authoritative retrieval to generate multi-token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.