Adversarial Training for Process Reward Models

Gurusha Juneja; Deepak Nathani; William Yang Wang

arXiv:2511.22888·cs.LG·December 1, 2025

Adversarial Training for Process Reward Models

Gurusha Juneja, Deepak Nathani, William Yang Wang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces APRM, an adversarial training method for process reward models that enhances their robustness and generalization in reasoning tasks without manual annotations.

Contribution

The paper proposes APRM, a novel adversarial training framework where a generator creates challenging errors to improve the detection capabilities of process reward models.

Findings

01

APRM improves solver accuracy by 3.4 percentage points on average.

02

APRM achieves 5.3 percentage points gain on out-of-distribution tasks.

03

The method enhances robustness and generalization of PRMs in mathematical reasoning.

Abstract

Process Reward Models (PRMs) enhance reasoning ability of LLMs by providing step-level supervision. However, their widespread adoption is limited due to expensive manual step-level annotation and poor generalization of static training data to novel errors. We introduce Adversarially Trained PRMs (\texttt{APRM}), where a Generator ( $G$ ) learns to produce reasoning errors to deceive a PRM ( $R$ ), while $R$ concurrently learns to detect them. This interaction yields progressively harder negatives for $R$ , improving its robustness and generalization to novel errors without requiring manual step-level labels. Averaged across diverse mathematical reasoning benchmarks, \texttt{APRM} improves solver accuracy by $+ 3.4$ percentage points (pp) over the strongest PRM baseline. \texttt{APRM} achieves gains of $+ 5.3$ pp on out-of-distribution tasks.

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 6Confidence 3

Strengths

1. The general-sum + Nash existence via Glicksberg is reasonable for the PRM and generator adversarial setup case. 2. The inclusion of OGDA and regularization terms is justified with references to convergence theory. 3. The evaluations cover multiple domains, datasets, and solvers, with consistent gains across both in-domain and out-of-distribution tasks.

Weaknesses

1. When summarizing synthetic data generation techniques, the authors miss works, MM-PRM, FG-PRM, and FreePRM. 2. Maintaining two RL-trained LLMs (generator + PRM) is computationally expensive and less accessible than data-driven PRMs. 3. APRM still struggles with certain logic or precondition-based reasoning failures that are difficult to model via token-level perturbations.

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper identifies a clear limitation in existing PRM training (static data, poor generalization) and proposes an elegant, game-theoretic solution. The adversarial setup for generating an adaptive curriculum of hard negatives is a conceptually compelling idea. 2. The method is evaluated thoroughly across diverse benchmarks and model families. The results are consistently strong, demonstrating not just better accuracy but also properties like robustness, scalability, and cross-domain genera

Weaknesses

1. The adversarial training process, involving two models trained with a combination of PPO and OGDA, is inherently more computationally expensive and complex than normal PRM training. 2. The step level oracles introduced in the paper are sophisticated but imperfect automated validators that enables scalable training without human annotation, but also introduces dependence on its rule completeness and accuracy. 3. A significant gap is the lack of comparison with a baseline that directly perfor

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper investigates an important and timely topic. 2. The method is supported by theoretical analysis. 3. Experimental results demonstrate the superior performance of the proposed method.

Weaknesses

1. The paper fails to cite and compare with [1], which also explores adversarial training for reward models. This omission makes it difficult to assess the precise novelty of this work relative to existing approaches. 2. The evaluation is limited to using the PRM for inference-time supervision. It remains unclear whether the APRM can perform effectively as a reward signal in an RL post-training setup, which is a critical and common application for reward models. I am open to raising my score i

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Ethics and Social Impacts of AI