TL;DR
This paper introduces SARM, a novel reward model architecture that uses a sparse autoencoder to provide interpretable, feature-level attributions and better adaptability for aligning large language models with human preferences.
Contribution
The paper proposes SARM, integrating a pretrained sparse autoencoder into reward models to enhance interpretability, feature attribution, and flexibility in preference shifts.
Findings
SARM enables direct feature-level attribution of reward scores.
SARM achieves superior alignment performance over traditional reward models.
SARM allows dynamic adjustment to changing human preferences.
Abstract
Large language models (LLMs) have been widely deployed across numerous fields. Reinforcement Learning from Human Feedback (RLHF) leverages reward models (RMs) as proxies for human preferences to align LLM behaviors with human values, making the accuracy, reliability, and interpretability of RMs critical for effective alignment. However, traditional RMs lack interpretability, offer limited insight into the reasoning behind reward assignments, and are inflexible toward user preference shifts. While recent multidimensional RMs aim for improved interpretability, they often fail to provide feature-level attribution and require costly annotations. To overcome these limitations, we introduce the Sparse Autoencoder-enhanced Reward Model (SARM), a novel architecture that integrates a pretrained Sparse Autoencoder (SAE) into a reward model. SARM maps the hidden activations of LLM-based RM into an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
