DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning

Chi-Min Chan; Ehsan Hajiramezanali; Xiner Li; Edward De Brouwer; Carl Edwards; Wei Xue; Sirui Han; Yike Guo; Gabriele Scalia

arXiv:2603.08095·cs.CL·March 10, 2026

DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning

Chi-Min Chan, Ehsan Hajiramezanali, Xiner Li, Edward De Brouwer, Carl Edwards, Wei Xue, Sirui Han, Yike Guo, Gabriele Scalia

PDF

Open Access

TL;DR

This paper introduces DC-W2S, a framework that improves the training of Process Reward Models in biological reasoning by selecting high-quality supervision signals from noisy data through dual consensus metrics, reducing reliance on expert annotations.

Contribution

The paper proposes a novel Dual-Consensus Weak-to-Strong framework that stratifies supervision signals using consensus metrics, enhancing the reliability of training PRMs with noisy supervision.

Findings

01

DC-W2S improves PRM robustness in biological reasoning tasks.

02

Strategic data curation outperforms large-scale noisy data training.

03

The framework reduces dependence on costly expert annotations.

Abstract

In scientific reasoning tasks, the veracity of the reasoning process is as critical as the final outcome. While Process Reward Models (PRMs) offer a solution to the coarse-grained supervision problems inherent in Outcome Reward Models (ORMs), their deployment is hindered by the prohibitive cost of obtaining expert-verified step-wise labels. This paper addresses the challenge of training reliable PRMs using abundant but noisy "weak" supervision. We argue that existing Weak-to-Strong Generalization (W2SG) theories lack prescriptive guidelines for selecting high-quality training signals from noisy data. To bridge this gap, we introduce the Dual-Consensus Weak-to-Strong (DC-W2S) framework. By intersecting Self-Consensus (SC) metrics among weak supervisors with Neighborhood-Consensus (NC) metrics in the embedding space, we stratify supervision signals into distinct reliability regimes. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Machine Learning in Healthcare · Machine Learning and Data Classification