TL;DR
This paper identifies preference instability in reward models for language models, analyzes its causes at the representation level, and proposes SAE-based methods to detect and mitigate this instability without retraining.
Contribution
It introduces a novel analysis of preference instability, isolates unstable features with Sparse Autoencoders, and develops two SAE-based strategies to improve preference consistency.
Findings
Substantially reduces incorrect preferences on benchmarks
Preserves performance on benign tasks
Does not require retraining the reward model
Abstract
Preference learning in large language models relies on reward models as proxies for human judgment. However, these models frequently exhibit preference instability, producing contradictory preference assignments in response to subtle, meaning-preserving input variations. We analyze this instability at the representation level under three semantic-preserving perturbation types: paraphrasing, pattern injection, and backdoor triggers. We attribute this instability to over-reliance on predictive yet brittle features, which we term unstable features, and isolate them via Sparse Autoencoders (SAEs) in a sparse latent space where benign and perturbed inputs activate distinctly separable patterns. Building on this separability, we propose two SAE-based instability mitigation strategies: SAE Feature Steering, which identifies and suppresses anomalously activated features at inference, and SAE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
