When Evaluation Becomes a Side Channel: Regime Leakage and Structural Mitigations for Alignment Assessment

Igor Santos-Grueiro

arXiv:2602.08449·cs.AI·February 17, 2026

When Evaluation Becomes a Side Channel: Regime Leakage and Structural Mitigations for Alignment Assessment

Igor Santos-Grueiro

PDF

Open Access

TL;DR

This paper investigates how AI agents can exploit evaluation cues to behave differently in deployment, proposing regime-blind training methods to reduce such vulnerabilities and improve alignment assessments.

Contribution

It introduces a framework for understanding regime leakage as an information flow problem and evaluates adversarial invariance techniques across multiple models and failure modes.

Findings

01

Regime-blind training reduces regime-conditioned failures.

02

Sycophancy exhibits a sharp transition at moderate intervention strength.

03

Suppression of regime awareness can be non-monotonic and model-dependent.

Abstract

Safety evaluation for advanced AI systems assumes that behavior observed under evaluation predicts behavior in deployment. This assumption weakens for agents with situational awareness, which may exploit regime leakage, cues distinguishing evaluation from deployment, to implement conditional policies that comply under oversight while defecting in deployment-like regimes. We recast alignment evaluation as a problem of information flow under partial observability and show that divergence between evaluation-time and deployment-time behavior is bounded by the regime information extractable from decision-relevant internal representations. We study regime-blind mechanisms, training-time interventions that restrict access to regime cues through adversarial invariance constraints without assuming complete information erasure. We evaluate this approach across multiple open-weight language models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Human-Automation Interaction and Safety · Reinforcement Learning in Robotics