Triviality Corrected Endogenous Reward
Xinda Wang, Zhengxu Hou, Yangshijie Zhang, Bingren Yan, Jialin Liu, Chenzhuo Zhao, Zhibo Yang, Bin-Bin Yang, Feng Xiao

TL;DR
This paper introduces TCER, a novel reward mechanism for reinforcement learning in open-ended text generation that mitigates triviality bias and improves output diversity and quality across tasks.
Contribution
We propose TCER, a new endogenous reward method that rewards information gain relative to a reference policy, addressing triviality bias in open-ended text generation.
Findings
TCER improves diversity and content quality in open-ended text generation.
TCER outperforms baseline methods across multiple benchmarks.
TCER effectively transfers to mathematical reasoning tasks.
Abstract
Reinforcement learning for open-ended text generation is constrained by the lack of verifiable rewards, necessitating reliance on judge models that require either annotated data or powerful closed-source models. Inspired by recent work on unsupervised reinforcement learning for mathematical reasoning using confidence-based endogenous rewards, we investigate whether this principle can be adapted to open-ended writing tasks. We find that directly applying confidence rewards leads to Triviality Bias: the policy collapses toward high-probability outputs, reducing diversity and meaningful content. We propose TCER (Triviality Corrected Endogenous Reward), which addresses this bias by rewarding the relative information gain between a specialist policy and a generalist reference policy, modulated by a probability-dependent correction mechanism. Across multiple writing benchmarks and model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
