Segment-Aligned Policy Optimization for Multi-Modal Reasoning
Lei Gao, Zhuoming Li, Mengxi Jia, Jiakang Yuan, Hongbo Sun, Hao Sun, Xuelong Li

TL;DR
This paper introduces Segment-Aligned Policy Optimization (SAPO), a reinforcement learning method that aligns policy updates with reasoning segments in multi-modal tasks, improving accuracy and training stability.
Contribution
SAPO is a novel RL paradigm that models reasoning steps as fundamental units, enhancing alignment with reasoning structure and outperforming traditional token or sequence-level methods.
Findings
SAPO outperforms token-level and sequence-level methods on reasoning benchmarks.
SAPO achieves higher accuracy and better training stability.
Codes and models will be released for reproducibility.
Abstract
Existing reinforcement learning approaches for Large Language Models typically perform policy optimization at the granularity of individual tokens or entire response sequences. However, such formulations often misalign with the natural step-wise structure of reasoning processes, leading to suboptimal credit assignment and unstable training in multi-modal reasoning tasks. To bridge this gap, we propose Segment-Aligned Policy Optimization (SAPO), a novel reinforcement learning paradigm that treats coherent reasoning steps, rather than tokens or full sequences as fundamental units of policy update. SAPO introduces a step-wise Markov decision process abstraction over reasoning segments, accompanied by segment-level value estimation, advantage computation, and importance sampling mechanisms that are semantically aligned with reasoning boundaries. Experiments on representative reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
