Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation
Jiashu Yao, Heyan Huang, Zeming Liu, Yuhang Guo

TL;DR
This paper introduces MISE, a reinforcement learning paradigm that uses generative self-evaluation as dense rewards, improving learning efficiency and performance of large language models in sparse reward settings.
Contribution
It provides the first formal theoretical foundation for generative self-rewarding and demonstrates how to calibrate these rewards to enhance LLM training.
Findings
MISE enables autonomous learning from dense internal rewards.
Theoretical analysis links self-evaluation rewards to mutual information and KL divergence.
Experiments show MISE improves LLM performance, matching GPT-4o on validation.
Abstract
To overcome the sparse reward challenge in reinforcement learning (RL) for agents based on large language models (LLMs), we propose Mutual Information Self-Evaluation (MISE), an RL paradigm that utilizes hindsight generative self-evaluation as dense reward signals while simultaneously calibrating them against the environmental feedbacks. Empirically, MISE enables an agent to learn autonomously from dense internal rewards supplementing sparse extrinsic signals. Theoretically, our work provides the first formal foundation for the paradigm of generative self-rewarding. We prove that utilizing hindsight self-evaluation rewards is equivalent to minimizing an objective that combines mutual information with a KL divergence term between the policy and a proxy reward policy. This theoretical insight then informs and justifies our calibration step, which actively aligns these rewards with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
