Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification

Jikai Wang; Zhenxu Tian; Juntao Li; Qingrong Xia; Xinyu Duan; Zhefeng Wang; Baoxing Huai; Min Zhang

arXiv:2505.13204·cs.CL·September 15, 2025

Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification

Jikai Wang, Zhenxu Tian, Juntao Li, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, Min Zhang

PDF

Open Access

TL;DR

This paper introduces a training-free speculative decoding method that uses alignment sampling and flexible verification to improve large language model generation speed and accuracy without additional training.

Contribution

It proposes a novel training-free alignment sampling and verification strategy that enhances speculative decoding efficiency and accuracy in large language models.

Findings

01

Increases average generation score by 3.3 points on 8 datasets.

02

Achieves a mean acceptance length of 2.39.

03

Speeds up generation by 2.23 times.

Abstract

Recent works have revealed the great potential of speculative decoding in accelerating the autoregressive generation process of large language models. The success of these methods relies on the alignment between draft candidates and the sampled outputs of the target model. Existing methods mainly achieve draft-target alignment with training-based methods, e.g., EAGLE, Medusa, involving considerable training costs. In this paper, we present a training-free alignment-augmented speculative decoding algorithm. We propose alignment sampling, which leverages output distribution obtained in the prefilling phase to provide more aligned draft candidates. To further benefit from high-quality but non-aligned draft candidates, we also introduce a simple yet effective flexible verification strategy. Through an adaptive probability threshold, our approach can improve generation accuracy while further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings