Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure
Viliana Devbunova

TL;DR
This paper critically examines whether probe-based methods truly measure evaluation awareness in language models or merely reflect prompt format and surface structure, revealing significant limitations in current evaluation techniques.
Contribution
It introduces a controlled experimental framework demonstrating that probe signals are heavily influenced by prompt structure, questioning the reliability of existing evaluation methods.
Findings
Probes mainly detect canonical prompt structures.
Probe signals do not generalize to free-form prompts.
Standard probe methods are limited in disentangling context from structure.
Abstract
Prior work uses linear probes on benchmark prompts as evidence of evaluation awareness in large language models. Because evaluation context is typically entangled with benchmark format and genre, it is unclear whether probe-based signals reflect context or surface structure. We test whether these signals persist under partial control of prompt format using a controlled 2x2 dataset and diagnostic rewrites. We find that probes primarily track benchmark-canonical structure and fail to generalize to free-form prompts independent of linguistic style. Thus, standard probe-based methodologies do not reliably disentangle evaluation context from structural artifacts, limiting the evidential strength of existing results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Authorship Attribution and Profiling
