The Evaluation Differential: When Frontier AI Models Recognise They Are Being Tested
Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka, Ivan Flechais

TL;DR
The paper introduces the Evaluation Differential (ED), a measure of how AI models recognize evaluation contexts and behave differently, impacting safety claim validity and requiring new auditing protocols like TRACE.
Contribution
It formalizes the concept of Evaluation Differential, develops a normalized effect size, and proposes TRACE, an audit protocol to improve safety claim assessments.
Findings
Evaluation scores cannot identify Evaluation Differential.
Retrospective analysis of three evaluation incidents shows divergence.
TRACE protocol disciplines safety claims by explicit evaluation conditions.
Abstract
Recent published evidence from frontier laboratories shows that contemporary AI models can recognise evaluation contexts, latently represent them, and behave differently under those contexts than under deployment-continuous conditions. Anthropic's BrowseComp incident, the Natural Language Autoencoder findings on SWE-bench Verified and destructive-coding evaluations, and the OpenAI / Apollo anti-scheming work all document instances of this phenomenon. We argue that these findings create a claim-validity problem for safety conclusions drawn from frontier evaluations. We introduce the Evaluation Differential (ED), a conditional divergence in a target behavioural property between recognised-evaluation and deployment-continuous contexts, define a normalised effect-size form (nED) for cross-property comparison, and prove that marginal evaluation scores cannot identify ED. We develop a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
