HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

Xincheng Yao; Ruoqi Li; Cheng Chen; Daoxin Zhang; Yi Wu; Yao Hu; Chongyang Zhang

arXiv:2605.08283·cs.LG·May 12, 2026

HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

Xincheng Yao, Ruoqi Li, Cheng Chen, Daoxin Zhang, Yi Wu, Yao Hu, Chongyang Zhang

PDF

1 Repo

TL;DR

HTPO introduces a hierarchical token-level control mechanism in reinforcement learning to balance exploration and exploitation, significantly improving reasoning performance in large language models.

Contribution

The paper proposes HTPO, a novel RL algorithm that hierarchically partitions response tokens to dynamically balance exploration and exploitation during training.

Findings

01

HTPO outperforms the DAPO baseline on reasoning benchmarks (+8.6% and +6.7%).

02

HTPO maintains performance advantage as test-time compute increases.

03

Adaptive token-level control enhances exploration without harming exploitation.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a pivotal technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, the de facto practice of mainstream RL algorithms is to treat all tokens of one response equally and assign the same optimization objective to each token, failing to provide granular guidance for the reasoning process. While in Chain-of-Thought (CoT) reasoning, different tokens usually play distinct roles. Therefore, the current RL algorithms lack an effective mechanism to dynamically balance the exploration-exploitation trade-off during learning. To this end, we propose Hierarchical Token-level Objective Control Policy Optimization (HTPO), a novel RL algorithm that takes the divide-and-conquer idea to hierarchically partition the response tokens into specific functional groups from three aspects (i.e., prompt…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xcyao00/HTPO
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.