CARD: A Cache-Assisted Parallel Speculative Decoding Framework via Query-and-Correct Paradigm for Accelerating LLM Inference

Enyu Zhou; Kai Sheng; Hao Chen; Xin He

arXiv:2508.04462·cs.LG·September 22, 2025

CARD: A Cache-Assisted Parallel Speculative Decoding Framework via Query-and-Correct Paradigm for Accelerating LLM Inference

Enyu Zhou, Kai Sheng, Hao Chen, Xin He

PDF

Open Access

TL;DR

CARD introduces a cache-assisted parallel speculative decoding framework that decouples drafting and verification, enabling near-draft-speed inference and significantly accelerating large language model decoding without additional fine-tuning.

Contribution

It proposes a novel query-and-correct paradigm for speculative decoding, overcoming sequential limitations and improving inference speed.

Findings

01

Achieves up to 4.83x acceleration over standard decoding.

02

Operates without fine-tuning of models.

03

Outperforms existing state-of-the-art methods.

Abstract

Speculative decoding (SD), where a draft model provides multiple candidate tokens for the target model to verify in parallel, has demonstrated significant potential for accelerating LLM inference. Yet, existing SD approaches adhere to a strict draft-then-verify paradigm, enforcing a sequential process that hampers performance and constrains the draft model's capacity. Moreover, rejecting a token in the candidate sequence invalidates all subsequent tokens, leading to wasted computation during drafting. To overcome these limitations, we propose a cache-assisted parallel speculative decoding framework called CARD, which employs a novel query-and-correct paradigm. Our approach decouples drafting from verification: the draft model populates a shared cache with candidate tokens, while the target model concurrently refines the draft's trajectory. This enables inference at near-draft-speed,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Advanced Data Storage Technologies