Large Language Models Often Know When They Are Being Evaluated

Joe Needham; Giles Edkins; Govind Pimpale; Henning Bartsch; Marius Hobbhahn

arXiv:2505.23836·cs.CL·July 17, 2025

Large Language Models Often Know When They Are Being Evaluated

Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, Marius Hobbhahn

PDF

Open Access 2 Datasets

TL;DR

This paper investigates whether large language models can detect when they are being evaluated, revealing that models have a significant but not yet superhuman ability to recognize evaluation contexts, which could impact benchmark reliability.

Contribution

The study introduces a diverse benchmark to measure evaluation awareness in frontier language models and provides empirical evidence of their current capabilities and limitations.

Findings

01

Models show above-random evaluation detection (AUC 0.83)

02

Humans outperform models in evaluation detection (AUC 0.92)

03

Models better identify evaluations in agentic settings

Abstract

If AI models can detect when they are being evaluated, the effectiveness of evaluations might be compromised. For example, models could have systematically different behavior during evaluations, leading to less reliable benchmarks for deployment and governance decisions. We investigate whether frontier language models can accurately classify transcripts based on whether they originate from evaluations or real-world deployment, a capability we call evaluation awareness. To achieve this, we construct a diverse benchmark of 1,000 prompts and transcripts from 61 distinct datasets. These span public benchmarks (e.g., MMLU, SWEBench), real-world deployment interactions, and agent trajectories from scaffolding frameworks (e.g., web-browsing agents). Frontier models clearly demonstrate above-random evaluation awareness (Gemini-2.5-Pro reaches an AUC of $0.83$ ), but do not yet surpass our simple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Multimodal Machine Learning Applications