Expected Return Causes Outcome-Level Mode Collapse in Reinforcement Learning and How to Fix It with Inverse Probability Scaling
Abhijeet Sinha, Sundari Elango, Dianbo Liu

TL;DR
This paper reveals that expected return objectives inherently cause outcome-level mode collapse in reinforcement learning and introduces inverse probability scaling as a simple fix to promote outcome diversity.
Contribution
The paper identifies the structural cause of mode collapse in RL due to the expected return objective and proposes inverse probability scaling as a minimal, effective correction.
Findings
Inverse probability scaling prevents outcome collapse in RL.
IPS-GRPO outperforms baseline methods in diverse outcome generation.
Theoretical analysis confirms the fundamental nature of the problem.
Abstract
Many reinforcement learning (RL) problems admit multiple terminal solutions of comparable quality, where the goal is not to identify a single optimum but to represent a diverse set of high-quality outcomes. Nevertheless, policies trained by standard expected return maximization routinely collapse onto a small subset of outcomes, a phenomenon commonly attributed to insufficient exploration or weak regularization. We show that this explanation is incomplete: outcome level mode collapse is a structural consequence of the expected-return objective itself. Under idealized learning dynamics, the log-probability ratio between any two outcomes evolves linearly in their reward difference, implying exponential ratio divergence and inevitable collapse independent of the exploration strategy, entropy regularization, or optimization algorithm. We identify the source of this pathology as the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Advanced Multi-Objective Optimization Algorithms
