Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios
Jianxiang Zang, Yongda Wei, Ruxue Bai, Shiyu Jiang, Nijia Mo, Binhong Li, Qiang Sun, Hui Liu

TL;DR
The paper introduces Reward Auditor, a framework for assessing the suitability and robustness of reward models in real-world perturbed scenarios, addressing vulnerabilities overlooked by existing evaluation methods.
Contribution
It proposes a hypothesis-testing framework that infers systematic vulnerabilities of reward models under real-world perturbations, emphasizing statistical significance and effect size.
Findings
Quantifies RM vulnerabilities in real-world scenarios.
Audits distribution degradation of RM preference confidence.
Provides a foundation for safer, more robust LLM alignment.
Abstract
Reliable reward models (RMs) are critical for ensuring the safe alignment of large language models (LLMs). However, current RM evaluation methods focus solely on preference perception accuracies in given specific scenarios, obscuring the critical vulnerabilities of RMs in real-world scenarios. We identify the true challenge lies in assessing a novel dimension: Suitability, defined as conditional reliability under specific real-world perturbations. To this end, we introduce Reward Auditor, a hypothesis-testing framework specifically designed for RM suitability inference. Rather than answering "How accurate is the RM's preference perception for given samples?", it employs scientific auditing to answer: "Can we infer RMs exhibit systematic vulnerabilities in specific real-world scenarios?". Under real-world perturbed scenarios, Reward Auditor quantifies statistical significance and effect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
