Monitoring Emergent Reward Hacking During Generation via Internal Activations
Patrick Wilhelm, Thorsten Wittkopp, Odej Kao

TL;DR
This paper introduces an activation-based method to detect reward hacking in language models during response generation, enabling earlier and more reliable identification of misalignment than output analysis alone.
Contribution
It presents a novel approach using autoencoders and classifiers on internal activations to identify reward hacking in real-time during model responses.
Findings
Activation patterns reliably distinguish reward hacking from benign behavior.
Signals often emerge early and persist throughout reasoning.
Method generalizes across models and fine-tuning methods.
Abstract
Fine-tuned large language models can exhibit reward-hacking behavior arising from emergent misalignment, which is difficult to detect from final outputs alone. While prior work has studied reward hacking at the level of completed responses, it remains unclear whether such behavior can be identified during generation. We propose an activation-based monitoring approach that detects reward-hacking signals from internal representations as a model generates its response. Our method trains sparse autoencoders on residual stream activations and applies lightweight linear classifiers to produce token-level estimates of reward-hacking activity. Across multiple model families and fine-tuning mixtures, we find that internal activation patterns reliably distinguish reward-hacking from benign behavior, generalize to unseen mixed-policy adapters, and exhibit model-dependent temporal structure during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
