Long-form RewardBench: Evaluating Reward Models for Long-form Generation

Hui Huang; Yancheng He; Wei Liu; Muyun Yang; Jiaheng Liu; Kehai Chen; Bing Xu; Conghui Zhu; Hailong Cao; Tiejun Zhao

arXiv:2603.12963·cs.CL·March 16, 2026

Long-form RewardBench: Evaluating Reward Models for Long-form Generation

Hui Huang, Yancheng He, Wei Liu, Muyun Yang, Jiaheng Liu, Kehai Chen, Bing Xu, Conghui Zhu, Hailong Cao, Tiejun Zhao

PDF

Open Access 1 Video

TL;DR

This paper introduces Long-form RewardBench, a comprehensive benchmark for evaluating reward models in long-form generation tasks, revealing current models' limitations and differences between classifier and generative approaches.

Contribution

It presents the first dedicated benchmark for long-form reward modeling, including a novel test and extensive evaluation of 20+ reward models across multiple subtasks.

Findings

01

Current reward models lack effective long-form generation capabilities.

02

Reward performance correlates with response length and error position.

03

Classifiers generalize better than generative models.

Abstract

The widespread adoption of reinforcement learning-based alignment highlights the growing importance of reward models. Various benchmarks have been built to evaluate reward models in various domains and scenarios. However, a significant gap remains in assessing reward models for long-form generation, despite its critical role in real-world applications. To bridge this, we introduce Long-form RewardBench, the first reward modeling testbed specifically designed for long-form generation. Our benchmark encompasses five key subtasks: QA, RAG, Chat, Writing, and Reasoning. We collected instruction and preference data through a meticulously designed multi-stage data collection process, and conducted extensive experiments on 20+ mainstream reward models, including both classifiers and generative models. Our findings reveal that current models still lack long-form reward modeling capabilities.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Long-form RewardBench: Evaluating Reward Models for Long-form Generation· underline

Taxonomy

TopicsRecommender Systems and Techniques · Explainable Artificial Intelligence (XAI) · Emotion and Mood Recognition