SteerRM: Debiasing Reward Models via Sparse Autoencoders
Mengyuan Sun, Zhuohao Yu, Weizheng Gu, Shikun Zhang, Wei Ye

TL;DR
SteerRM introduces a training-free, SAE-based method to debias reward models by suppressing stylistic biases at inference, improving accuracy without retraining across multiple architectures.
Contribution
This paper presents SteerRM, a novel approach that uses Sparse Autoencoders for bias mitigation in reward models without retraining or architectural changes.
Findings
Improves Hard-split accuracy by 7.3 points on average across six RMs.
Effectively generalizes across different bias types and architectures.
Identifies shared bias encoding patterns in shallow layers.
Abstract
Reward models (RMs) are critical components of alignment pipelines, yet they exhibit biases toward superficial stylistic cues, preferring better-presented responses over semantically superior ones. Existing debiasing methods typically require retraining or architectural modifications, while direct activation suppression degrades performance due to representation entanglement. We propose SteerRM, the first training-free method for debiasing reward models using Sparse Autoencoder (SAE)-based interventions. SteerRM isolates stylistic effects using contrastive paired responses, identifies bias-related SAE features with a strength-stability criterion, and suppresses them at inference time. Across six reward models on RM-Bench, SteerRM improves Hard-split accuracy by 7.3 points on average while preserving overall performance. Results on a Gemma-based reward model and a controlled non-format…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
