Distribution-Centric Policy Optimization Dominates Exploration-Exploitation Trade-off

Zhaochun Li; Chen Wang; Jionghao Bai; Shisheng Cui; Ge Lan; Zhou Zhao; Yue Wang

arXiv:2601.12730·cs.LG·January 21, 2026

Distribution-Centric Policy Optimization Dominates Exploration-Exploitation Trade-off

Zhaochun Li, Chen Wang, Jionghao Bai, Shisheng Cui, Ge Lan, Zhou Zhao, Yue Wang

PDF

Open Access 1 Models

TL;DR

This paper introduces Distribution-Centric Policy Optimization (DCPO), a novel reinforcement learning approach that controls exploration by guiding policies with target distributions, improving stability and efficiency over existing sample-centric methods.

Contribution

The paper presents the first distribution-centric perspective in RL, reformulating entropy regulation as a distribution-level regularization for better exploration control.

Findings

01

DCPO outperforms GRPO by about 20% on average across benchmarks.

02

DCPO enables controllable, on-policy entropy regulation without external samples.

03

Distribution-level regularization enhances exploration stability and efficiency.

Abstract

The exploration-exploitation (EE) trade-off is a central challenge in reinforcement learning (RL) for large language models (LLMs). With Group Relative Policy Optimization (GRPO), training tends to be exploitation driven: entropy decreases monotonically, samples convergence, and exploration fades. Most existing fixes are \textbf{sample-centric}: they seek or bonus rare samples, assuming exploration comes from novel trajectories and tokens. These heuristics depend on the "luck" of informative samples, lack principled control of the policy, and often yield limited or inconsistent gains. In this work, we are the first to introduce a \textbf{distribution-centric} perspective for RL, in which exploration is always guided by a "better" target distribution, and reveal that a policy's ability to resist entropy collapse is governed by the distribution itself rather than individual samples.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
wc597358816/DCPO_Qwen2.5-math-7B
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Topic Modeling · Multimodal Machine Learning Applications