Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework
Shourov Joarder, Diganta Sikdar, Ahsan Habib Akash, Binod Bhattarai, Prashnna Gyawali

TL;DR
This paper introduces a multi-reward reinforcement learning framework for large language models that enhances stability and reasoning ability by combining answer-level and token-level rewards with specialized regularization.
Contribution
It proposes a novel multi-reward RLIF method with normalization and regularization techniques to prevent reward hacking and collapse, improving unsupervised reasoning performance.
Findings
Improved stability and robustness over prior unsupervised RL methods.
Achieved performance close to supervised RLVR on reasoning and code-generation tasks.
Effectively prevents entropy collapse and reward hacking through regularization.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning ability of LLMs, but often depends on external supervision from human annotations or gold-standard solutions. Reinforcement learning from internal feedback (RLIF) has recently emerged as a scalable unsupervised alternative, using signals extracted from the model itself. However, existing RLIF methods typically rely on a single internal reward, which can lead to reward hacking, entropy collapse, and degraded reasoning structure. We propose a multi-reward RLIF framework that decomposes the training signal into two complementary components: an answer-level reward based on cluster voting and a completion-level reward based on token-wise self-certainty. To combine these signals robustly, we apply GDPO-based normalization to reduce reward-scale imbalance. We further introduce KL-Cov regularization,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
