Long-form RewardBench: Evaluating Reward Models for Long-form Generation
Hui Huang, Yancheng He, Wei Liu, Muyun Yang, Jiaheng Liu, Kehai Chen, Bing Xu, Conghui Zhu, Hailong Cao, Tiejun Zhao

TL;DR
This paper introduces Long-form RewardBench, a comprehensive benchmark for evaluating reward models in long-form generation tasks, revealing current models' limitations and differences between classifier and generative approaches.
Contribution
It presents the first dedicated benchmark for long-form reward modeling, including a novel test and extensive evaluation of 20+ reward models across multiple subtasks.
Findings
Current reward models lack effective long-form generation capabilities.
Reward performance correlates with response length and error position.
Classifiers generalize better than generative models.
Abstract
The widespread adoption of reinforcement learning-based alignment highlights the growing importance of reward models. Various benchmarks have been built to evaluate reward models in various domains and scenarios. However, a significant gap remains in assessing reward models for long-form generation, despite its critical role in real-world applications. To bridge this, we introduce Long-form RewardBench, the first reward modeling testbed specifically designed for long-form generation. Our benchmark encompasses five key subtasks: QA, RAG, Chat, Writing, and Reasoning. We collected instruction and preference data through a meticulously designed multi-stage data collection process, and conducted extensive experiments on 20+ mainstream reward models, including both classifiers and generative models. Our findings reveal that current models still lack long-form reward modeling capabilities.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsRecommender Systems and Techniques · Explainable Artificial Intelligence (XAI) · Emotion and Mood Recognition
