CARD: A Cache-Assisted Parallel Speculative Decoding Framework via Query-and-Correct Paradigm for Accelerating LLM Inference
Enyu Zhou, Kai Sheng, Hao Chen, Xin He

TL;DR
CARD introduces a cache-assisted parallel speculative decoding framework that decouples drafting and verification, enabling near-draft-speed inference and significantly accelerating large language model decoding without additional fine-tuning.
Contribution
It proposes a novel query-and-correct paradigm for speculative decoding, overcoming sequential limitations and improving inference speed.
Findings
Achieves up to 4.83x acceleration over standard decoding.
Operates without fine-tuning of models.
Outperforms existing state-of-the-art methods.
Abstract
Speculative decoding (SD), where a draft model provides multiple candidate tokens for the target model to verify in parallel, has demonstrated significant potential for accelerating LLM inference. Yet, existing SD approaches adhere to a strict draft-then-verify paradigm, enforcing a sequential process that hampers performance and constrains the draft model's capacity. Moreover, rejecting a token in the candidate sequence invalidates all subsequent tokens, leading to wasted computation during drafting. To overcome these limitations, we propose a cache-assisted parallel speculative decoding framework called CARD, which employs a novel query-and-correct paradigm. Our approach decouples drafting from verification: the draft model populates a shared cache with candidate tokens, while the target model concurrently refines the draft's trajectory. This enables inference at near-draft-speed,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Advanced Data Storage Technologies
