KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning
Wei Sun, Wen Yang, Pu Jian, Qianlong Du, Fuwei Cui, Shuo Ren, Jiajun Zhang

TL;DR
This paper introduces KTAE, a new model-free algorithm that provides fine-grained token-level advantage estimates in mathematical reasoning, improving reinforcement learning performance without additional models.
Contribution
KTAE offers a novel, model-free method for token-level advantage estimation, addressing granularity issues in existing reinforcement learning algorithms for language models.
Findings
Models with KTAE outperform baselines on five reasoning benchmarks.
KTAE achieves higher accuracy with shorter responses.
Surpasses R1-Distill-Qwen-1.5B with the same base model.
Abstract
Recent advances have demonstrated that integrating reinforcement learning with rule-based rewards can significantly enhance the reasoning capabilities of large language models, even without supervised fine-tuning. However, prevalent reinforcement learning algorithms such as GRPO and its variants like DAPO, suffer from a coarse granularity issue when computing the advantage. Specifically, they compute rollout-level advantages that assign identical values to every token within a sequence, failing to capture token-specific contributions and hindering effective learning. To address this limitation, we propose Key-token Advantage Estimation (KTAE) - a novel algorithm that estimates fine-grained, token-level advantages without introducing additional models. KTAE leverages the correctness of sampled rollouts and applies statistical analysis to quantify the importance of individual tokens…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Natural Language Processing Techniques
MethodsDialogue-Adaptive Pre-training Objective · Balanced Selection
