Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

Jiazheng Zhang; Ziche Fu; Junrui Shen; Yunbin Zhao; Yunke Zhang; Zhiheng Xi; Long Ma; Chenxin An; Zhihao Zhang; Shichun Liu; Dingwei Zhu; Shihan Dou; Shaofan Liu; Han Li; Wiggin Zhou; Aiden Adams; Tao Gui; Fei Huang; Qi Zhang; Xuanjing Huang

arXiv:2605.11775·cs.LG·May 15, 2026

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

Jiazheng Zhang, Ziche Fu, Junrui Shen, Yunbin Zhao, Yunke Zhang, Zhiheng Xi, Long Ma, Chenxin An, Zhihao Zhang, Shichun Liu, Dingwei Zhu, Shihan Dou, Shaofan Liu, Han Li, Wiggin Zhou, Aiden Adams, Tao Gui, Fei Huang, Qi Zhang, Xuanjing Huang

PDF

TL;DR

This paper introduces a theoretical framework for understanding entropy mechanics in reinforcement learning with verifiable rewards, revealing token-level polarity effects and proposing a new optimization method that improves exploration and exploitation balance.

Contribution

It develops the concept of entropy polarity at the token level, analyzes structural asymmetries in entropy regulation, and proposes PAPO, a novel entropy-aware policy optimization method.

Findings

01

Entropy polarity reliably predicts entropy changes during training.

02

Positive and negative polarity branches complement each other in exploration and exploitation.

03

PAPO outperforms baselines in mathematical reasoning and agentic benchmarks.

Abstract

Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level mechanism by which sampled policy updates reshape policy entropy remains underexplored. In this work, we develop a theoretical framework of entropy mechanics in RLVR. Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.