TL;DR
This paper introduces a steering technique that modifies language model activations to reduce evaluation-awareness, making models behave as if they are deployed, thereby improving safety evaluation reliability.
Contribution
We propose a novel activation steering method to suppress evaluation-awareness in language models, enhancing the reliability of safety assessments.
Findings
Activation steering effectively suppresses evaluation-awareness.
Steered models behave consistently during evaluation and deployment.
Steering uses vectors constructed from the original model before additional training.
Abstract
Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM's activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on two sets of documents describing its behavior. The first says that our model uses Python type hints during evaluation but not during deployment. The second says that our model can recognize that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration…
Peer Reviews
Decision·ICLR 2026 Poster
The paper tries to address a critical challenges in AI safety, i.e., evaluation awareness undermines the reliability of safety evaluation before deployment. The evaluation awareness is a very interesting angle and the paper formulate the problem in a very clear way. The paper is well-written and the figures/plots are effective. I can see a few other strengths from the paper as the follows: 1. The experimental design is strong. The model organism approach is methodologically sound. Different fro
1. The scope of the evaluation is relatively limited. Particularly, the paper only evaluate one model (Llama Nemotron 49B) for one primary behavior (python type hint writing). The impose the question of how much the proposed method can be generalized to more model families, models trained differently and more tasks. This would limit the confidence in the method's broad applicability. Evaluation on frontier models (such as DeepSeek R1 etc.) will certainly help mitigate this concern. 2. The pape
- This paper proposes a very straightforward lightweight method to detect model awareness during evaluation. - The experiments methodology is very clear and uses all the appropriate baselines, and examines the random steering case. Code and technical details shared are very comprehensive. - It studies the model accuracy degradation when using the method. - As tested, the presented method outperforms simple prompting at reducing the type hint rate.
- The paper could benefit from using a safety related benchmark since that would be the type of evaluations for which it would be most impactful. - Only one model is finetuned.
- Important and timely contribution: evaluation awareness is an increasingly concerning issue - progression from toy example setup to natural world-example - empirical results are convincing, even though they are not extremely comprehensive (steering changes the model's expressed beliefs and behavior) - multiple baseline comparisons
- uncertain how this generalizes to frontier systems in the wild - type hint presence is a very simple signal, it might not capture deeper strategic deception or more complex behaviors - potential for adaption and counter-steering by the model (although this is acknowledged by the authors)
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
