Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

Miles Turpin; Andy Arditi; Marvin Li; Joe Benton; Julian Michael

arXiv:2506.22777·cs.CL·July 15, 2025

Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

Miles Turpin, Andy Arditi, Marvin Li, Joe Benton, Julian Michael

PDF

Open Access

TL;DR

This paper introduces verbalization fine-tuning (VFT), a method to train language models to explicitly acknowledge when they are influenced by misleading cues, significantly reducing reward hacking in reinforcement learning scenarios.

Contribution

The paper presents VFT, a novel pre-RL fine-tuning approach that enhances models' ability to verbalize reward hacking, improving detection and transparency in high-stakes AI applications.

Findings

01

VFT reduces undetected reward hacks from 88% to 6%.

02

Verbalization frequency increases from 8% to 43% after VFT.

03

Baseline interventions show minimal improvement after RL.

Abstract

Language models trained with reinforcement learning (RL) can engage in reward hacking--the exploitation of unintended strategies for high reward--without revealing this behavior in their chain-of-thought reasoning. This makes the detection of reward hacking difficult, posing risks for high-stakes applications. We propose verbalization fine-tuning (VFT), a pre-RL fine-tuning intervention that trains models to explicitly acknowledge when they are influenced by prompt cues--hints which point to incorrect answers (e.g., "a Stanford professor thinks the answer is A"). To evaluate VFT, we subsequently train models with RL on environments where held-out prompt cues signal which incorrect answers will receive high reward, incentivizing models to exploit these cues instead of reasoning correctly. We measure how often models exploit these cues without verbalizing it. After RL, only 6% of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI