SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder

Dengcan Liu; Jiahao Li; Zheren Fu; Yi Tu; Jiajun Li; Zhendong Mao; Yongdong Zhang

arXiv:2511.07896·cs.AI·November 12, 2025

SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder

Dengcan Liu, Jiahao Li, Zheren Fu, Yi Tu, Jiajun Li, Zhendong Mao, Yongdong Zhang

PDF

Open Access 1 Video

TL;DR

SparseRM introduces a lightweight, interpretable reward model using sparse autoencoders to efficiently capture preference features in LLM representations, reducing training costs while maintaining high performance.

Contribution

The paper presents SparseRM, a novel approach that leverages sparse autoencoders to build efficient, interpretable reward models with minimal parameters for preference modeling in LLMs.

Findings

01

SparseRM outperforms most mainstream reward models.

02

Uses less than 1% of trainable parameters.

03

Easily integrates into downstream alignment pipelines.

Abstract

Reward models (RMs) are a core component in the post-training of large language models (LLMs), serving as proxies for human preference evaluation and guiding model alignment. However, training reliable RMs under limited resources remains challenging due to the reliance on large-scale preference annotations and the high cost of fine-tuning LLMs. To address this, we propose SparseRM, which leverages Sparse Autoencoder (SAE) to extract preference-relevant information encoded in model representations, enabling the construction of a lightweight and interpretable reward model. SparseRM first employs SAE to decompose LLM representations into interpretable directions that capture preference-relevant features. The representations are then projected onto these directions to compute alignment scores, which quantify the strength of each preference feature in the representations. A simple reward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder· underline

Taxonomy

TopicsRecommender Systems and Techniques · Sentiment Analysis and Opinion Mining · Topic Modeling