Owen-Shapley Policy Optimization: A Principled RL Algorithm for Generative Search LLMs
Abhijnan Nath, Alireza Bagheri Garakani, Tianchen Zhou, Fan Yang, Yan Gao, Nikhil Krishnaswamy

TL;DR
OSPO is a new reinforcement learning framework for large language models that improves credit assignment by attributing token contributions, leading to better performance and robustness in generative search tasks.
Contribution
Introduces Owen-Shapley Policy Optimization (OSPO), a novel method that redistributes sequence-level rewards based on token contributions without requiring parametric value models.
Findings
OSPO achieves consistent performance gains over baselines.
OSPO demonstrates improved robustness to out-of-distribution retrievers.
OSPO effectively identifies influential response segments in generative tasks.
Abstract
Large language models are increasingly trained via reinforcement learning for personalized recommendation tasks, but standard methods like GRPO rely on sparse, sequence-level rewards. These obscure which tokens actually contribute to high-quality outputs, creating a credit assignment gap. This gap is especially problematic when models must infer latent user intent from under-specified language without ground truth labels, which is a reasoning pattern rarely seen during pretraining but commonly required in deployment. We introduce Owen-Shapley Policy Optimization (OSPO), a framework that redistributes sequence-level advantages based on tokens' marginal contributions to outcomes. OSPO transforms task feedback into potential-based reward shaping via Shapley-Owen attributions to assign segment-level credit while preserving the optimal policy, all without parametric value models. By forming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
