P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering
Wenlin Zhong, Chengyuan Liu, Yiquan Wu, Bovin Tan, Changlong Sun, Yi Wang, Xiaozhong Liu, Kun Kuang

TL;DR
This paper introduces Probabilistic Process Supervision (P2S), a self-supervised reinforcement learning framework that enhances reasoning in large language models by providing step-by-step process rewards without human annotations.
Contribution
P2S offers a novel method for fine-grained, process-level supervision in RL for reasoning tasks, improving upon outcome-only reward approaches without needing extra annotations.
Findings
P2S significantly improves performance on reading comprehension benchmarks.
P2S outperforms strong baseline models in medical question answering.
The method effectively addresses reward sparsity in reinforcement learning for reasoning.
Abstract
While reinforcement learning with verifiable rewards (RLVR) has advanced LLM reasoning in structured domains like mathematics and programming, its application to general-domain reasoning tasks remains challenging due to the absence of verifiable reward signals. To this end, methods like Reinforcement Learning with Reference Probability Reward (RLPR) have emerged, leveraging the probability of generating the final answer as a reward signal. However, these outcome-focused approaches neglect crucial step-by-step supervision of the reasoning process itself. To address this gap, we introduce Probabilistic Process Supervision (P2S), a novel self-supervision framework that provides fine-grained process rewards without requiring a separate reward model or human-annotated reasoning steps. During reinforcement learning, P2S synthesizes and filters a high-quality reference reasoning chain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
