OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning

Yu Li; Rui Miao; Tian Lan; Zhengling Qi

arXiv:2605.21851·cs.LG·May 22, 2026

OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning

Yu Li, Rui Miao, Tian Lan, Zhengling Qi

PDF

TL;DR

OPPO introduces a Bayesian approach for token-level credit assignment in LLM reasoning, improving upon existing methods by providing more precise and aggregated success probability estimates without additional training overhead.

Contribution

The paper proposes a novel Bayesian-based framework, OPPO, that enhances token-level credit assignment in LLMs by accumulating signals along trajectories, outperforming prior methods.

Findings

01

OPPO outperforms GRPO, DAPO, and SDPO on multiple benchmarks.

02

OPPO achieves up to +6.0 points on AMC'23 and +5.2 points on AIME'24.

03

Gains from OPPO increase with longer responses.

Abstract

Reinforcement learning with verifiable rewards has become the standard recipe for improving LLM reasoning, but the dominant algorithm GRPO assigns a single trajectory-level advantage to every token, diluting the signal at pivotal reasoning steps and injecting noise at uninformative ones. Critic-free alternatives derived from on-policy distillation supply per-token signals through oracle-conditioned likelihood ratios, yet apply each signal in isolation from the trajectory-level evidence accumulated up to that position. We propose Oracle-Prompted Policy Optimization (OPPO), which rests on a single observation: the oracle signal used by prior distillation-style methods for local discrimination is also the natural Bayesian update of the model's belief about eventual success. Accumulating the signal along a trajectory yields, in closed form and at the cost of one extra forward pass, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.