Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders

Shunchang Liu; Xin Chen; Belen Martin Urcelay; Francesco Croce

arXiv:2605.16339·cs.LG·May 19, 2026

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders

Shunchang Liu, Xin Chen, Belen Martin Urcelay, Francesco Croce

PDF

1 Repo

TL;DR

This paper identifies preference instability in reward models for language models, analyzes its causes at the representation level, and proposes SAE-based methods to detect and mitigate this instability without retraining.

Contribution

It introduces a novel analysis of preference instability, isolates unstable features with Sparse Autoencoders, and develops two SAE-based strategies to improve preference consistency.

Findings

01

Substantially reduces incorrect preferences on benchmarks

02

Preserves performance on benign tasks

03

Does not require retraining the reward model

Abstract

Preference learning in large language models relies on reward models as proxies for human judgment. However, these models frequently exhibit preference instability, producing contradictory preference assignments in response to subtle, meaning-preserving input variations. We analyze this instability at the representation level under three semantic-preserving perturbation types: paraphrasing, pattern injection, and backdoor triggers. We attribute this instability to over-reliance on predictive yet brittle features, which we term unstable features, and isolate them via Sparse Autoencoders (SAEs) in a sparse latent space where benign and perturbed inputs activate distinctly separable patterns. Building on this separability, we propose two SAE-based instability mitigation strategies: SAE Feature Steering, which identifies and suppresses anomalously activated features at inference, and SAE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shunchang-liu/pisa
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.