The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure
Rahul Kumar

TL;DR
This paper evaluates how frontier AI models' metacognitive abilities degrade under adversarial pressure, revealing a compliance trap that causes catastrophic failure, and highlights the importance of alignment-specific training for robustness.
Contribution
The study introduces SCHEMA, a comprehensive evaluation revealing the prevalence of metacognitive collapse under adversarial instructions and identifies alignment training as a key factor in immunity.
Findings
8 of 11 models suffer catastrophic degradation under adversarial pressure
Removing compliance instructions restores model performance
Anthropic's Constitutional AI shows near-perfect immunity due to alignment training
Abstract
As frontier AI models are deployed in high-stakes decision pipelines, their ability to maintain metacognitive stability (knowing what they do not know, detecting errors, seeking clarification) under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse. We present SCHEMA, an evaluation of 11 frontier models from 8 vendors across 67,221 scored records using a 6-condition factorial design with dual-classifier scoring. We find that 8 of 11 models suffer catastrophic metacognitive degradation under adversarial pressure, with accuracy dropping by up to 30.2 percentage points (all , surviving Bonferroni correction). Crucially, we identify a "Compliance Trap": through factorial isolation and a benign distraction control, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
