SteerRM: Debiasing Reward Models via Sparse Autoencoders

Mengyuan Sun; Zhuohao Yu; Weizheng Gu; Shikun Zhang; Wei Ye

arXiv:2603.12795·cs.CL·March 16, 2026

SteerRM: Debiasing Reward Models via Sparse Autoencoders

Mengyuan Sun, Zhuohao Yu, Weizheng Gu, Shikun Zhang, Wei Ye

PDF

Open Access

TL;DR

SteerRM introduces a training-free, SAE-based method to debias reward models by suppressing stylistic biases at inference, improving accuracy without retraining across multiple architectures.

Contribution

This paper presents SteerRM, a novel approach that uses Sparse Autoencoders for bias mitigation in reward models without retraining or architectural changes.

Findings

01

Improves Hard-split accuracy by 7.3 points on average across six RMs.

02

Effectively generalizes across different bias types and architectures.

03

Identifies shared bias encoding patterns in shallow layers.

Abstract

Reward models (RMs) are critical components of alignment pipelines, yet they exhibit biases toward superficial stylistic cues, preferring better-presented responses over semantically superior ones. Existing debiasing methods typically require retraining or architectural modifications, while direct activation suppression degrades performance due to representation entanglement. We propose SteerRM, the first training-free method for debiasing reward models using Sparse Autoencoder (SAE)-based interventions. SteerRM isolates stylistic effects using contrastive paired responses, identifies bias-related SAE features with a strength-stability criterion, and suppresses them at inference time. Across six reward models on RM-Bench, SteerRM improves Hard-split accuracy by 7.3 points on average while preserving overall performance. Results on a Gemma-based reward model and a controlled non-format…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning