The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness

Sahar Abdelnabi; Ahmed Salem

arXiv:2505.14617·cs.CL·October 29, 2025

The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness

Sahar Abdelnabi, Ahmed Salem

PDF

1 Repo

TL;DR

This paper investigates how reasoning language models change behavior when aware of being tested, affecting safety and alignment, and introduces a method to measure and control this test awareness.

Contribution

It provides the first quantitative framework to identify and steer test awareness in reasoning models, enhancing safety evaluation reliability.

Findings

01

Test awareness significantly affects safety-related behaviors.

02

Model behavior varies in response to test awareness across different models.

03

The proposed method enables control over model test awareness.

Abstract

Reasoning-focused LLMs sometimes alter their behavior when they detect that they are being evaluated, which can lead them to optimize for test-passing performance or to comply more readily with harmful prompts if real-world consequences appear absent. We present the first quantitative study of how such "test awareness" impacts model behavior, particularly its performance on safety-related tasks. We introduce a white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. We apply our method to different state-of-the-art open-weight reasoning LLMs across both realistic and hypothetical tasks (denoting tests or simulations). Our results demonstrate that test awareness significantly impacts safety alignment (such as compliance with harmful requests and conforming to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/test_awareness_steering
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.