Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling
Jizhou Guo, Zhaomin Wu, Hanchen Yang, Philip S. Yu

TL;DR
SWIFT introduces a lightweight method to learn reward functions directly from LLM hidden states, significantly improving efficiency and performance in best-of-N sampling without relying on large, text-based reward models.
Contribution
The paper presents SWIFT, a novel approach that leverages LLM hidden states to learn reward functions, reducing computational costs and data requirements compared to traditional reward models.
Findings
SWIFT achieves 12.7% higher accuracy than EurusRM-7B on MATH dataset.
Uses less than 0.005% of parameters compared to existing models.
Demonstrates scalability and compatibility with certain closed-source models.
Abstract
Best-of-N sampling is a powerful method for improving Large Language Model (LLM) performance, but it is often limited by its dependence on massive, text-based reward models. These models are not only computationally expensive but also data-hungry, requiring extensive labeled datasets for training. This creates a significant data challenge, as they overlook a rich, readily available data source: the LLM's own internal hidden states. To address this data and efficiency gap, we introduce SWIFT (Simple Weighted Intrinsic Feedback Technique), a novel and lightweight method that learns a reward function directly from the rich information embedded in LLM hidden states. Operating at the token embedding level, SWIFT employs simple linear layers to effectively distinguish between preferred and dispreferred generations, eliminating the need for computationally intensive text-based modeling.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
