Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models

Mufan Xu; Kehai Chen; Xuefeng Bai; Zhengyu Niu; Muyun Yang; Tiejun Zhao; Min Zhang

arXiv:2602.14386·cs.CL·February 17, 2026

Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models

Mufan Xu, Kehai Chen, Xuefeng Bai, Zhengyu Niu, Muyun Yang, Tiejun Zhao, Min Zhang

PDF

Open Access

TL;DR

This paper introduces Multi-token Policy Gradient Optimization (MPO), a new framework that treats sequences of tokens as unified actions to better capture the structure of complex reasoning in language models, outperforming token-level methods.

Contribution

The paper proposes MPO, a novel block-level policy gradient method that enhances reasoning capabilities of language models by optimizing over multi-token actions, addressing limitations of token-level approaches.

Findings

01

MPO outperforms standard token-level policy gradients on reasoning benchmarks.

02

Token-level policy gradients have limitations for complex reasoning tasks.

03

Block-level optimization better captures the structure of reasoning in language models.

Abstract

Existing policy-gradient methods for auto-regressive language models typically select subsequent tokens one at a time as actions in the policy. While effective for many generation tasks, such an approach may not fully capture the structure of complex reasoning tasks, where a single semantic decision is often realized across multiple tokens--for example, when defining variables or composing equations. This introduces a potential mismatch between token-level optimization and the inherently block-level nature of reasoning in these settings. To bridge this gap, we propose Multi-token Policy Gradient Optimization (MPO), a framework that treats sequences of K consecutive tokens as unified semantic actions. This block-level perspective enables our method to capture the compositional structure of reasoning trajectories and supports optimization over coherent, higher-level objectives.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education