P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering

Wenlin Zhong; Chengyuan Liu; Yiquan Wu; Bovin Tan; Changlong Sun; Yi Wang; Xiaozhong Liu; Kun Kuang

arXiv:2601.20649·cs.CL·January 29, 2026

P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering

Wenlin Zhong, Chengyuan Liu, Yiquan Wu, Bovin Tan, Changlong Sun, Yi Wang, Xiaozhong Liu, Kun Kuang

PDF

Open Access 1 Video

TL;DR

This paper introduces Probabilistic Process Supervision (P2S), a self-supervised reinforcement learning framework that enhances reasoning in large language models by providing step-by-step process rewards without human annotations.

Contribution

P2S offers a novel method for fine-grained, process-level supervision in RL for reasoning tasks, improving upon outcome-only reward approaches without needing extra annotations.

Findings

01

P2S significantly improves performance on reading comprehension benchmarks.

02

P2S outperforms strong baseline models in medical question answering.

03

The method effectively addresses reward sparsity in reinforcement learning for reasoning.

Abstract

While reinforcement learning with verifiable rewards (RLVR) has advanced LLM reasoning in structured domains like mathematics and programming, its application to general-domain reasoning tasks remains challenging due to the absence of verifiable reward signals. To this end, methods like Reinforcement Learning with Reference Probability Reward (RLPR) have emerged, leveraging the probability of generating the final answer as a reward signal. However, these outcome-focused approaches neglect crucial step-by-step supervision of the reasoning process itself. To address this gap, we introduce Probabilistic Process Supervision (P2S), a novel self-supervision framework that provides fine-grained process rewards without requiring a separate reward model or human-annotated reasoning steps. During reinforcement learning, P2S synthesizes and filters a high-quality reference reasoning chain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering· underline

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)