Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models
Mufan Xu, Kehai Chen, Xuefeng Bai, Zhengyu Niu, Muyun Yang, Tiejun Zhao, Min Zhang

TL;DR
This paper introduces Multi-token Policy Gradient Optimization (MPO), a new framework that treats sequences of tokens as unified actions to better capture the structure of complex reasoning in language models, outperforming token-level methods.
Contribution
The paper proposes MPO, a novel block-level policy gradient method that enhances reasoning capabilities of language models by optimizing over multi-token actions, addressing limitations of token-level approaches.
Findings
MPO outperforms standard token-level policy gradients on reasoning benchmarks.
Token-level policy gradients have limitations for complex reasoning tasks.
Block-level optimization better captures the structure of reasoning in language models.
Abstract
Existing policy-gradient methods for auto-regressive language models typically select subsequent tokens one at a time as actions in the policy. While effective for many generation tasks, such an approach may not fully capture the structure of complex reasoning tasks, where a single semantic decision is often realized across multiple tokens--for example, when defining variables or composing equations. This introduces a potential mismatch between token-level optimization and the inherently block-level nature of reasoning in these settings. To bridge this gap, we propose Multi-token Policy Gradient Optimization (MPO), a framework that treats sequences of K consecutive tokens as unified semantic actions. This block-level perspective enables our method to capture the compositional structure of reasoning trajectories and supports optimization over coherent, higher-level objectives.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education
