Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

Jiashu Yao; Heyan Huang; Chuwei Luo; Daiqing Wu; Zeming Liu; Yuhang Guo; Yangyang Kang

arXiv:2604.11510·cs.CL·April 14, 2026

Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

Jiashu Yao, Heyan Huang, Chuwei Luo, Daiqing Wu, Zeming Liu, Yuhang Guo, Yangyang Kang

PDF

TL;DR

Policy Split introduces a dual-mode policy framework for LLM reinforcement learning, enhancing exploration without sacrificing accuracy through collaborative entropy regularization.

Contribution

It presents a novel dual-mode policy paradigm with shared parameters and tailored entropy regularization, improving exploration in LLM RL tasks.

Findings

01

Outperforms existing entropy-guided RL baselines across various model sizes.

02

Enables dual-mode exploration with distinct behavioral patterns.

03

Enhances both general and creative task performance.

Abstract

To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.