Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner

Bolian Li; Yanran Wu; Xinyu Luo; Ruqi Zhang

arXiv:2508.15044·cs.CL·September 24, 2025

Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner

Bolian Li, Yanran Wu, Xinyu Luo, Ruqi Zhang

PDF

Open Access 1 Video

TL;DR

This paper introduces reward-shifted speculative sampling, a method that efficiently aligns large language models with human preferences during inference by leveraging a draft model, reducing costs while maintaining high reward scores.

Contribution

The paper proposes a novel reward-shifted speculative sampling algorithm that improves test-time alignment efficiency without sacrificing alignment quality.

Findings

01

Achieves higher reward scores with lower inference costs.

02

Effectively exploits distributional shifts between draft and target models.

03

Validates efficiency and effectiveness through experiments.

Abstract

Aligning large language models (LLMs) with human preferences has become a critical step in their development. Recent research has increasingly focused on test-time alignment, where additional compute is allocated during inference to enhance LLM safety and reasoning capabilities. However, these test-time alignment techniques often incur substantial inference costs, limiting their practical application. We are inspired by the speculative sampling acceleration, which leverages a small draft model to efficiently predict future tokens, to address the efficiency bottleneck of test-time alignment. We introduce the reward-shifted speculative sampling (SSS) algorithm, in which the draft model is aligned with human preferences, while the target model remains unchanged. We theoretically demonstrate that the distributional shift between the aligned draft model and the unaligned target model can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner· underline

Taxonomy

TopicsTopic Modeling · Mobile Crowdsensing and Crowdsourcing · Explainable Artificial Intelligence (XAI)