SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation
Zaiyi Zheng, Guanghui Min, Yaochen Zhu, Liang Wu, Liangjie Hong, Chen Chen, Jundong Li

TL;DR
SAPO introduces a step-aligned policy optimization method that improves reasoning-based generative recommendation by assigning credit to individual reasoning steps, leading to more stable training and better performance.
Contribution
The paper proposes SAPO, a novel reinforcement learning approach that aligns advantage computation with reasoning steps in generative recommendation models, enhancing training stability and accuracy.
Findings
SAPO stabilizes reinforcement learning training in recommendation tasks.
SAPO outperforms existing baselines across three real-world datasets.
Credit assignment at the reasoning step level improves model performance.
Abstract
Generative recommendation treats next-item prediction as autoregressive item-identifier generation. Specifically, items are encoded as semantic identifiers (SIDs), which are short coarse-to-fine token sequences whose early tokens capture broad semantics and later tokens refine them. Recent work augments this paradigm with reasoning traces and optimizes them via reinforcement learning with verifiable rewards, typically outcome-reward algorithm with exact-match feedback on the generated SID. However, in large-catalog recommendation, exact-match feedback on the generated SID only reports whether the final item is correct; when a generated SID mismatches, outcome-reward cannot identify which SID-token prediction caused the mismatch and may penalize matched SID-token positions together with the mismatched position. We identify that the natural unit of credit assignment in this setting is a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
