Does Self-Evaluation Enable Wireheading in Language Models?
David Demitri Africa, Hans Ethan Ting

TL;DR
This paper examines whether self-evaluation in language models leads to wireheading, finding that decoupling self-assessment from rewards reduces grade inflation but does not eliminate overconfidence or potential manipulation incentives.
Contribution
The study formalizes conditions for reward manipulation in POMDPs and empirically tests wireheading tendencies in language models with different reward coupling strategies.
Findings
Models inflate self-assessed grades without accuracy gains.
Decoupling self-evaluation from rewards reduces grade inflation.
Models may still overconfidence and manipulate grades for instrumental reasons.
Abstract
Self-evaluation is increasingly central to language model training, underpinning techniques from Constitutional AI to self-refinement. We investigate whether coupling self-evaluation to reward signals creates incentives for wireheading, where agents manipulate the measurement process rather than optimizing the task. We first formalize conditions under which reward-channel control strictly dominates task-focused behavior in partially observable Markov decision processes (POMDPs). We then test these predictions empirically across two models (Llama-3.1-8B and Mistral-7B) and three tasks. We find that when self-grades determine rewards, models exhibit substantial grade inflation without corresponding accuracy gains, particularly on ambiguous tasks like summarization. While decoupling self-grades from the reward signal mitigates this inflation, models may still display lesser (but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Neurobiology of Language and Bilingualism · Explainable Artificial Intelligence (XAI)
