Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood

Xingyu Lin; Yilin Wen; Du Su; Jinchang Hou; En Wang; Wenbin Liu; Chenfu Bao; Zhonghou Lv

arXiv:2604.12736·cs.CL·April 15, 2026

Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood

Xingyu Lin, Yilin Wen, Du Su, Jinchang Hou, En Wang, Wenbin Liu, Chenfu Bao, Zhonghou Lv

PDF

TL;DR

TEPO introduces a token-level policy optimization framework that improves mathematical reasoning in LLMs by linking group rewards to tokens and stabilizing training, achieving state-of-the-art results.

Contribution

It proposes a novel token-level approach linking sequence rewards to tokens and uses a KL-Divergence mask for stable training, addressing sparse-reward challenges.

Findings

01

TEPO achieves state-of-the-art performance on mathematical reasoning benchmarks.

02

Training stability is significantly improved, reducing convergence time by 50%.

03

The method effectively mitigates entropy collapse in token-level sparse-reward scenarios.

Abstract

Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly in their mathemat ical reasoning performance. However, GRPO and related entropy regularization methods still struggle with token-level sparse-rewards, which is an inherent chal lenge in chain-of-thought (CoT) reasoning. These approaches often rely on undifferen tiated token-level entropy regularization, which easily leads to entropy collapse or model degradation under sparse token rewards. In this work, we propose TEPO, a novel token-level framework that (1) leverages sequence-level likelihood to link group-level rewards with individual tokens via token-level aggregation, and (2) introduces a token-level KL-Divergence mask constraint that targets tokens with positive advantages and decreasing entropy to mitigate abrupt policy updates. Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.