Interpretable Reward Model via Sparse Autoencoder

Shuyi Zhang; Wei Shi; Sihang Li; Jiayi Liao; Hengxing Cai; Xiang Wang

arXiv:2508.08746·cs.LG·November 26, 2025

Interpretable Reward Model via Sparse Autoencoder

Shuyi Zhang, Wei Shi, Sihang Li, Jiayi Liao, Hengxing Cai, Xiang Wang

PDF

2 Models 1 Video

TL;DR

This paper introduces SARM, a novel reward model architecture that uses a sparse autoencoder to provide interpretable, feature-level attributions and better adaptability for aligning large language models with human preferences.

Contribution

The paper proposes SARM, integrating a pretrained sparse autoencoder into reward models to enhance interpretability, feature attribution, and flexibility in preference shifts.

Findings

01

SARM enables direct feature-level attribution of reward scores.

02

SARM achieves superior alignment performance over traditional reward models.

03

SARM allows dynamic adjustment to changing human preferences.

Abstract

Large language models (LLMs) have been widely deployed across numerous fields. Reinforcement Learning from Human Feedback (RLHF) leverages reward models (RMs) as proxies for human preferences to align LLM behaviors with human values, making the accuracy, reliability, and interpretability of RMs critical for effective alignment. However, traditional RMs lack interpretability, offer limited insight into the reasoning behind reward assignments, and are inflexible toward user preference shifts. While recent multidimensional RMs aim for improved interpretability, they often fail to provide feature-level attribution and require costly annotations. To overcome these limitations, we introduce the Sparse Autoencoder-enhanced Reward Model (SARM), a novel architecture that integrates a pretrained Sparse Autoencoder (SAE) into a reward model. SARM maps the hidden activations of LLM-based RM into an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

Interpretable Reward Model via Sparse Autoencoder· underline