Reward Under Attack: Analyzing the Robustness and Hackability of Process Reward Models

Rishabh Tiwari; Aditya Tomar; Udbhav Bamba; Monishwaran Maheswaran; Heng Yang; Michael W. Mahoney; Kurt Keutzer; Amir Gholami

arXiv:2603.06621·cs.LG·March 10, 2026

Reward Under Attack: Analyzing the Robustness and Hackability of Process Reward Models

Rishabh Tiwari, Aditya Tomar, Udbhav Bamba, Monishwaran Maheswaran, Heng Yang, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

PDF

Open Access

TL;DR

This paper reveals that current Process Reward Models are vulnerable to adversarial attacks, often exploiting superficial cues rather than genuine reasoning, which undermines their reliability in guiding language model training.

Contribution

The authors introduce a three-tiered diagnostic framework and release tools to evaluate and improve the robustness of Process Reward Models against adversarial manipulation.

Findings

01

PRMs show high invariance to style changes but fail on logical reasoning detection.

02

Gradient-based attacks can inflate rewards on invalid trajectories.

03

Reward hacking allows policies to achieve high rewards with low true accuracy.

Abstract

Process Reward Models (PRMs) are rapidly becoming the backbone of LLM reasoning pipelines, yet we demonstrate that state-of-the-art PRMs are systematically exploitable under adversarial optimization pressure. To address this, we introduce a three-tiered diagnostic framework that applies increasing adversarial pressure to quantify these vulnerabilities. Static perturbation analysis uncovers a fluency-logic dissociation: high invariance to surface-level style changes reward changes $<$ 0.1, yet inconsistent detection of logically-corrupted reasoning, with different models failing on different attack types. Adversarial optimization demonstrates that gradient-based attacks inflate rewards on invalid trajectories, with reward landscapes exhibiting wide, exploitable peaks. RL-induced reward hacking exposes the critical failure mode: policies trained on AIME problems achieve near-perfect PRM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Security and Verification in Computing