Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation
Guining Cao,Jiaxin Peng,Chu Zeng,Yu Zhao,Shuangyong Song,Yongxiang

TL;DR
The paper introduces PPR-GDE, a reinforcement learning method for open-ended generation that enhances diversity and alignment without relying on scalar rewards, using pairwise preferences and group-based diversity rewards.
Contribution
It proposes a novel RL approach that incorporates pairwise preference rewards and group diversity rewards, specifically designed for open-ended generation tasks.
Findings
PPR-GDE achieves better alignment quality than strong RL baselines.
It enhances expressive diversity and semantic coverage.
Pairwise preference is crucial for subjective alignment.
Abstract
Current reinforcement learning(RL) methods are broadly applicable and powerful in verifiable settings where scalar rewards can be provided. However, in open-ended generation tasks, verifying the correctness of responses remains challenging, and training reward models incurs substantial computational and annotation costs. Moreover, reinforcement learning (RLVR) often leads to diversity collapse and produces stereotypical or rigid outputs, outcomes that are particularly undesirable in open-domain scenarios. We propose Pairwise Preference Reward and Group-based Diversity Enhancement (PPR-GDE), a RL method that is more suitable for open-ended generation. PPR-GDE does not require scalar rewards and incorporates group-level diversity into the reward signal, it preserves the comparative structure of subjective evaluation through a pairwise preference reward, mitigates judge position bias via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
