Correct Answers from Sound Reasoning: Verifiable Process Supervision for Language Models

Kyuyoung Kim; Kevin Wang; Yunfei Xie; Peiyang Xu; Peiyao Sheng; Chen Wei; Zhangyang Wang; Jinwoo Shin; Pramod Viswanath; Sewoong Oh

arXiv:2605.12519·cs.CL·May 14, 2026

Correct Answers from Sound Reasoning: Verifiable Process Supervision for Language Models

Kyuyoung Kim, Kevin Wang, Yunfei Xie, Peiyang Xu, Peiyao Sheng, Chen Wei, Zhangyang Wang, Jinwoo Shin, Pramod Viswanath, Sewoong Oh

PDF

TL;DR

This paper introduces verifiable process supervision (VPS), a post-training framework that enhances language models' reasoning quality without sacrificing accuracy, demonstrated on a chess domain with deterministic verification.

Contribution

VPS jointly optimizes prediction accuracy and reasoning quality using structured reasoning induction, process-level rewards, and adaptive weighting, improving reasoning reliability.

Findings

01

VPS preserves accuracy while significantly improving reasoning quality.

02

Accuracy-only RL degrades reasoning, increasing errors and reducing consistency.

03

VPS outperforms traditional RL in verifiable reasoning tasks, restoring internal consistency.

Abstract

Training language models to produce both correct answers and sound reasoning remains an open challenge. Reinforcement learning with verifiable rewards typically optimizes only final outcomes, which can lead to a failure mode where task accuracy improves while reasoning becomes less accurate, less complete, or even internally inconsistent. We propose verifiable process supervision (VPS), a post-training framework for verifiable domains that jointly optimizes prediction accuracy and reasoning quality. We first apply supervised fine-tuning to induce a structured reasoning format, enabling syntactic extraction of intermediate claims that are evaluated against ground-truth signals to form process-level rewards. To address the heterogeneous difficulty of reasoning subtasks, we introduce adaptive reward weighting that prioritizes components with the largest remaining errors, creating an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.