Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling

Jizhou Guo; Zhaomin Wu; Hanchen Yang; Philip S. Yu

arXiv:2505.12225·cs.LG·January 9, 2026

Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling

Jizhou Guo, Zhaomin Wu, Hanchen Yang, Philip S. Yu

PDF

1 Models 1 Datasets

TL;DR

SWIFT introduces a lightweight method to learn reward functions directly from LLM hidden states, significantly improving efficiency and performance in best-of-N sampling without relying on large, text-based reward models.

Contribution

The paper presents SWIFT, a novel approach that leverages LLM hidden states to learn reward functions, reducing computational costs and data requirements compared to traditional reward models.

Findings

01

SWIFT achieves 12.7% higher accuracy than EurusRM-7B on MATH dataset.

02

Uses less than 0.005% of parameters compared to existing models.

03

Demonstrates scalability and compatibility with certain closed-source models.

Abstract

Best-of-N sampling is a powerful method for improving Large Language Model (LLM) performance, but it is often limited by its dependence on massive, text-based reward models. These models are not only computationally expensive but also data-hungry, requiring extensive labeled datasets for training. This creates a significant data challenge, as they overlook a rich, readily available data source: the LLM's own internal hidden states. To address this data and efficiency gap, we introduce SWIFT (Simple Weighted Intrinsic Feedback Technique), a novel and lightweight method that learns a reward function directly from the rich information embedded in LLM hidden states. Operating at the token embedding level, SWIFT employs simple linear layers to effectively distinguish between preferred and dispreferred generations, eliminating the need for computationally intensive text-based modeling.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Aster2024/swift-ministral-8b-deepscaler
model· 2 dl· ♡ 2
2 dl♡ 2

Datasets

Aster2024/swift-reasoning-rollouts-deepscaler-ministral8b
dataset· 19 dl
19 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.