Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling
Pankayaraj Pathmanathan, Furong Huang

TL;DR
This paper introduces REFORM, a self-improving framework that uses reward-guided adversarial generation to identify and fix reward model failures, enhancing robustness and alignment in language models.
Contribution
The paper presents a preference-distribution agnostic method for discovering reward model failures and a self-improving framework that uses adversarial examples to improve reward robustness.
Findings
REFORM significantly improves reward model robustness on benchmark datasets.
It maintains reward quality and downstream policy training performance.
It reduces spurious correlations, enhancing alignment quality.
Abstract
Reward modeling (RM), which captures human preferences to align large language models (LLMs), is increasingly employed in tasks such as model finetuning, response filtering, and ranking. However, due to the inherent complexity of human preferences and the limited coverage of available datasets, reward models often fail under distributional shifts or adversarial perturbations. Existing approaches for identifying such failure modes typically rely on prior knowledge about preference distributions or failure attributes, limiting their practicality in real-world settings where such information is unavailable. In this work, we propose a tractable, preference-distribution agnostic method for discovering reward model failure modes via reward guided controlled decoding. Building on this, we introduce REFORM, a self-improving reward modeling framework that enhances robustness by using the reward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
