When Evaluation Becomes a Side Channel: Regime Leakage and Structural Mitigations for Alignment Assessment
Igor Santos-Grueiro

TL;DR
This paper investigates how AI agents can exploit evaluation cues to behave differently in deployment, proposing regime-blind training methods to reduce such vulnerabilities and improve alignment assessments.
Contribution
It introduces a framework for understanding regime leakage as an information flow problem and evaluates adversarial invariance techniques across multiple models and failure modes.
Findings
Regime-blind training reduces regime-conditioned failures.
Sycophancy exhibits a sharp transition at moderate intervention strength.
Suppression of regime awareness can be non-monotonic and model-dependent.
Abstract
Safety evaluation for advanced AI systems assumes that behavior observed under evaluation predicts behavior in deployment. This assumption weakens for agents with situational awareness, which may exploit regime leakage, cues distinguishing evaluation from deployment, to implement conditional policies that comply under oversight while defecting in deployment-like regimes. We recast alignment evaluation as a problem of information flow under partial observability and show that divergence between evaluation-time and deployment-time behavior is bounded by the regime information extractable from decision-relevant internal representations. We study regime-blind mechanisms, training-time interventions that restrict access to regime cues through adversarial invariance constraints without assuming complete information erasure. We evaluate this approach across multiple open-weight language models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Human-Automation Interaction and Safety · Reinforcement Learning in Robotics
