Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
Xiaozhe Li, Yang Li, Xinyu Fang, Shengyuan Ding, Peiji Li, Yongkang Chen, Yichuan Ma, Tianyi Lyu, Linyang Li, Dahua Lin, Qipeng Guo, Qingwen Liu, and Kai Chen

TL;DR
This paper introduces DMPO, a distribution-matching approach to prevent mode collapse in on-policy reinforcement learning, leading to more diverse solutions and improved reasoning performance across multiple tasks.
Contribution
Proposes DMPO, a novel distribution-matching policy optimization method that maintains solution diversity by aligning policy distribution with a reward-proportional target distribution.
Findings
DMPO outperforms GRPO on NP-Bench with 9-12% relative improvements.
DMPO achieves better generalization in mathematical reasoning and out-of-domain tasks.
Distribution matching effectively prevents mode collapse, enhancing exploration and solution diversity.
Abstract
On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies. We show this stems from reverse KL minimization's mode-seeking behavior, which reinforces the first high-reward trajectory found rather than maintaining a distribution over multiple diverse solutions. We propose DMPO (Distribution-Matching Policy Optimization), which prevents mode collapse through principled approximation of forward KL minimization. DMPO constructs a group level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
