RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models
Daniel Yang, Samuel Stante, Florian Redhardt, Lena Libon, Parnian Kassraie, Ido Hakimi, Barna P\'asztor, Andreas Krause

TL;DR
RewardUQ introduces a comprehensive framework for evaluating uncertainty quantification in reward models, highlighting the impact of model size and initialization, and providing tools for improved alignment of language models with human preferences.
Contribution
It systematically compares uncertainty quantification methods for reward models and offers an open-source framework for evaluation and development.
Findings
Model size and initialization significantly affect performance.
Most prior methods could be improved with alternative design choices.
The framework facilitates better evaluation and deployment of reward models.
Abstract
Reward models are central to aligning large language models (LLMs) with human preferences. Yet most approaches rely on pointwise reward estimates that overlook the epistemic uncertainty in reward models arising from limited human feedback. Recent work suggests that quantifying this uncertainty can reduce the costs of human annotation via uncertainty-guided active learning and mitigate reward overoptimization in LLM post-training. However, uncertainty-aware reward models have so far been adopted without thorough comparison, leaving them poorly understood. This work introduces a unified framework, RewardUQ, to systematically evaluate uncertainty quantification for reward models. We compare common methods along standard metrics measuring accuracy and calibration, and we propose a new ranking strategy incorporating both dimensions for a simplified comparison. Our experimental results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Recommender Systems and Techniques
