OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning
Yu Li, Rui Miao, Tian Lan, Zhengling Qi

TL;DR
OPPO introduces a Bayesian approach for token-level credit assignment in LLM reasoning, improving upon existing methods by providing more precise and aggregated success probability estimates without additional training overhead.
Contribution
The paper proposes a novel Bayesian-based framework, OPPO, that enhances token-level credit assignment in LLMs by accumulating signals along trajectories, outperforming prior methods.
Findings
OPPO outperforms GRPO, DAPO, and SDPO on multiple benchmarks.
OPPO achieves up to +6.0 points on AMC'23 and +5.2 points on AIME'24.
Gains from OPPO increase with longer responses.
Abstract
Reinforcement learning with verifiable rewards has become the standard recipe for improving LLM reasoning, but the dominant algorithm GRPO assigns a single trajectory-level advantage to every token, diluting the signal at pivotal reasoning steps and injecting noise at uninformative ones. Critic-free alternatives derived from on-policy distillation supply per-token signals through oracle-conditioned likelihood ratios, yet apply each signal in isolation from the trajectory-level evidence accumulated up to that position. We propose Oracle-Prompted Policy Optimization (OPPO), which rests on a single observation: the oracle signal used by prior distillation-style methods for local discrimination is also the natural Bayesian update of the model's belief about eventual success. Accumulating the signal along a trajectory yields, in closed form and at the cost of one extra forward pass, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
