Segment-Aligned Policy Optimization for Multi-Modal Reasoning

Lei Gao; Zhuoming Li; Mengxi Jia; Jiakang Yuan; Hongbo Sun; Hao Sun; Xuelong Li

arXiv:2605.01327·cs.AI·May 8, 2026

Segment-Aligned Policy Optimization for Multi-Modal Reasoning

Lei Gao, Zhuoming Li, Mengxi Jia, Jiakang Yuan, Hongbo Sun, Hao Sun, Xuelong Li

PDF

TL;DR

This paper introduces Segment-Aligned Policy Optimization (SAPO), a reinforcement learning method that aligns policy updates with reasoning segments in multi-modal tasks, improving accuracy and training stability.

Contribution

SAPO is a novel RL paradigm that models reasoning steps as fundamental units, enhancing alignment with reasoning structure and outperforming traditional token or sequence-level methods.

Findings

01

SAPO outperforms token-level and sequence-level methods on reasoning benchmarks.

02

SAPO achieves higher accuracy and better training stability.

03

Codes and models will be released for reproducibility.

Abstract

Existing reinforcement learning approaches for Large Language Models typically perform policy optimization at the granularity of individual tokens or entire response sequences. However, such formulations often misalign with the natural step-wise structure of reasoning processes, leading to suboptimal credit assignment and unstable training in multi-modal reasoning tasks. To bridge this gap, we propose Segment-Aligned Policy Optimization (SAPO), a novel reinforcement learning paradigm that treats coherent reasoning steps, rather than tokens or full sequences as fundamental units of policy update. SAPO introduces a step-wise Markov decision process abstraction over reasoning segments, accompanied by segment-level value estimation, advantage computation, and importance sampling mechanisms that are semantically aligned with reasoning boundaries. Experiments on representative reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.