DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

Xiaofan Li; Ming Yang; Zhiyuan Ma; Shichao Ma; Jintao Du; Yu Cheng; Weiqiang Wang; Zhizhong Zhang; Xin Tan; Yanyun Qu; Lizhuang Ma; Yuan Xie

arXiv:2604.13902·cs.LG·April 16, 2026

DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

Xiaofan Li, Ming Yang, Zhiyuan Ma, Shichao Ma, Jintao Du, Yu Cheng, Weiqiang Wang, Zhizhong Zhang, Xin Tan, Yanyun Qu, Lizhuang Ma, Yuan Xie

PDF

TL;DR

This paper introduces DiPO, a method that disentangles perplexity to better manage exploration and exploitation in reinforcement learning for large language models, improving their reasoning and function calling abilities.

Contribution

It proposes a novel perplexity space disentangling strategy and a bidirectional reward mechanism for fine-grained exploration-exploitation trade-off in LLM training.

Findings

01

Demonstrates improved performance on mathematical reasoning tasks.

02

Shows enhanced function calling accuracy.

03

Validates the effectiveness of perplexity-guided exploration.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.