Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation

Zongxia Li; Yapei Chang; Yuhang Zhou; Xiyang Wu; Zichao Liang; Yoo Yeon Sung; Jordan Lee Boyd-Graber

arXiv:2506.15068·cs.CL·June 19, 2025

Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation

Zongxia Li, Yapei Chang, Yuhang Zhou, Xiyang Wu, Zichao Liang, Yoo Yeon Sung, Jordan Lee Boyd-Graber

PDF

Open Access 1 Models

TL;DR

This paper introduces PrefBERT, a semantic evaluation model for open-ended long-form generation that improves reward signals in training language models, leading to outputs more aligned with human preferences.

Contribution

We propose PrefBERT, a novel scoring model trained on diverse datasets to provide better semantic rewards for open-ended generation, surpassing traditional metrics in guiding model training.

Findings

01

PrefBERT outperforms ROUGE-L and BERTScore in semantic evaluation.

02

Training with PrefBERT yields responses more aligned with human preferences.

03

PrefBERT remains reliable across varied long-form responses.

Abstract

Evaluating open-ended long-form generation is challenging because it is hard to define what clearly separates good from bad outputs. Existing methods often miss key aspects like coherence, style, or relevance, or are biased by pretraining data, making open-ended long-form evaluation an underexplored problem. To address this gap, we propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO and guiding its training with distinct rewards for good and bad outputs. Trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality, PrefBERT effectively supports GRPO by offering better semantic reward feedback than traditional metrics ROUGE-L and BERTScore do. Through comprehensive evaluations, including LLM-as-a-judge, human ratings, and qualitative analysis, we show that PrefBERT, trained on multi-sentence and paragraph-length…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
IntelligenceLab/RewardPreferenceBert
model· 57 dl· ♡ 3
57 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSimulation Techniques and Applications · Scientific Computing and Data Management