Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction

Ruike Song; Zeen Song; Huijie Guo; Wenwen Qiang

arXiv:2508.04216·cs.LG·August 7, 2025

Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction

Ruike Song, Zeen Song, Huijie Guo, Wenwen Qiang

PDF

1 Video

TL;DR

This paper introduces Causal Reward Adjustment (CRA), a novel method that reduces reward hacking in external reasoning systems by using causal inference techniques to estimate true reasoning path rewards, thereby improving accuracy.

Contribution

CRA is the first approach to apply backdoor adjustment with autoencoders for mitigating reward hacking in reasoning systems without retraining the policy or PRM.

Findings

01

CRA reduces reward hacking effectively.

02

CRA improves reasoning accuracy on math datasets.

03

CRA does not require retraining the policy or PRM.

Abstract

External reasoning systems combine language models with process reward models (PRMs) to select high-quality reasoning paths for complex tasks such as mathematical problem solving. However, these systems are prone to reward hacking, where high-scoring but logically incorrect paths are assigned high scores by the PRMs, leading to incorrect answers. From a causal inference perspective, we attribute this phenomenon primarily to the presence of confounding semantic features. To address it, we propose Causal Reward Adjustment (CRA), a method that mitigates reward hacking by estimating the true reward of a reasoning path. CRA trains sparse autoencoders on the PRM's internal activations to recover interpretable features, then corrects confounding by using backdoor adjustment. Experiments on math solving datasets demonstrate that CRA mitigates reward hacking and improves final accuracy, without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction· underline