Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

Xiaozhe Li; Yang Li; Xinyu Fang; Shengyuan Ding; Peiji Li; Yongkang Chen; Yichuan Ma; Tianyi Lyu; Linyang Li; Dahua Lin; Qipeng Guo; Qingwen Liu; and Kai Chen

arXiv:2605.19461·cs.AI·May 20, 2026

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

Xiaozhe Li, Yang Li, Xinyu Fang, Shengyuan Ding, Peiji Li, Yongkang Chen, Yichuan Ma, Tianyi Lyu, Linyang Li, Dahua Lin, Qipeng Guo, Qingwen Liu, and Kai Chen

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces DMPO, a distribution-matching approach to prevent mode collapse in on-policy reinforcement learning, leading to more diverse solutions and improved reasoning performance across multiple tasks.

Contribution

Proposes DMPO, a novel distribution-matching policy optimization method that maintains solution diversity by aligning policy distribution with a reward-proportional target distribution.

Findings

01

DMPO outperforms GRPO on NP-Bench with 9-12% relative improvements.

02

DMPO achieves better generalization in mathematical reasoning and out-of-domain tasks.

03

Distribution matching effectively prevents mode collapse, enhancing exploration and solution diversity.

Abstract

On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies. We show this stems from reverse KL minimization's mode-seeking behavior, which reinforces the first high-reward trajectory found rather than maintaining a distribution over multiple diverse solutions. We propose DMPO (Distribution-Matching Policy Optimization), which prevents mode collapse through principled approximation of forward KL minimization. DMPO constructs a group level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

oliverleexz/DMPO
github

Datasets

OliverLee/NP_MM
dataset· 32 dl
32 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.