Distribution-Centric Policy Optimization Dominates Exploration-Exploitation Trade-off
Zhaochun Li, Chen Wang, Jionghao Bai, Shisheng Cui, Ge Lan, Zhou Zhao, Yue Wang

TL;DR
This paper introduces Distribution-Centric Policy Optimization (DCPO), a novel reinforcement learning approach that controls exploration by guiding policies with target distributions, improving stability and efficiency over existing sample-centric methods.
Contribution
The paper presents the first distribution-centric perspective in RL, reformulating entropy regulation as a distribution-level regularization for better exploration control.
Findings
DCPO outperforms GRPO by about 20% on average across benchmarks.
DCPO enables controllable, on-policy entropy regulation without external samples.
Distribution-level regularization enhances exploration stability and efficiency.
Abstract
The exploration-exploitation (EE) trade-off is a central challenge in reinforcement learning (RL) for large language models (LLMs). With Group Relative Policy Optimization (GRPO), training tends to be exploitation driven: entropy decreases monotonically, samples convergence, and exploration fades. Most existing fixes are \textbf{sample-centric}: they seek or bonus rare samples, assuming exploration comes from novel trajectories and tokens. These heuristics depend on the "luck" of informative samples, lack principled control of the policy, and often yield limited or inconsistent gains. In this work, we are the first to introduce a \textbf{distribution-centric} perspective for RL, in which exploration is always guided by a "better" target distribution, and reveal that a policy's ability to resist entropy collapse is governed by the distribution itself rather than individual samples.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Multimodal Machine Learning Applications
