TL;DR
PARD-2 introduces a dual-mode speculative decoding framework with Confidence-Adaptive Token optimization, significantly accelerating LLM inference by better aligning draft model training with acceptance length goals.
Contribution
It reformulates draft model training to focus on acceptance length, enabling a single model to support multiple modes and achieve substantial speedups.
Findings
Achieves up to 6.94× acceleration in LLM inference.
Surpasses EAGLE-3 by 1.9× and PARD by 1.3× on Llama3.1-8B.
Supports both target-dependent and target-independent modes.
Abstract
Speculative decoding accelerates Large Language Models (LLMs) inference by using a lightweight draft model to propose candidate tokens that are verified in parallel by the target model. However, existing draft model training objectives are not directly aligned with the inference-time goal of maximizing consecutive token acceptance. To address this issue, we reformulate the draft model optimization objective, shifting the focus from token prediction accuracy to the overall acceptance length. In this paper, we build upon PARD to propose PARD-2, a dual-mode speculative decoding framework with Confidence-Adaptive Token (CAT) optimization. This approach adaptively reweights each token to better align with the verification process. Notably, PARD-2 enables a single draft model to support both target-dependent and target-independent modes. Experiments across diverse models and tasks demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
