TL;DR
HTPO introduces a hierarchical token-level control mechanism in reinforcement learning to balance exploration and exploitation, significantly improving reasoning performance in large language models.
Contribution
The paper proposes HTPO, a novel RL algorithm that hierarchically partitions response tokens to dynamically balance exploration and exploitation during training.
Findings
HTPO outperforms the DAPO baseline on reasoning benchmarks (+8.6% and +6.7%).
HTPO maintains performance advantage as test-time compute increases.
Adaptive token-level control enhances exploration without harming exploitation.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a pivotal technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, the de facto practice of mainstream RL algorithms is to treat all tokens of one response equally and assign the same optimization objective to each token, failing to provide granular guidance for the reasoning process. While in Chain-of-Thought (CoT) reasoning, different tokens usually play distinct roles. Therefore, the current RL algorithms lack an effective mechanism to dynamically balance the exploration-exploitation trade-off during learning. To this end, we propose Hierarchical Token-level Objective Control Policy Optimization (HTPO), a novel RL algorithm that takes the divide-and-conquer idea to hierarchically partition the response tokens into specific functional groups from three aspects (i.e., prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
