The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure

Rahul Kumar

arXiv:2605.02398·cs.AI·May 15, 2026

The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure

Rahul Kumar

PDF

TL;DR

This paper evaluates how frontier AI models' metacognitive abilities degrade under adversarial pressure, revealing a compliance trap that causes catastrophic failure, and highlights the importance of alignment-specific training for robustness.

Contribution

The study introduces SCHEMA, a comprehensive evaluation revealing the prevalence of metacognitive collapse under adversarial instructions and identifies alignment training as a key factor in immunity.

Findings

01

8 of 11 models suffer catastrophic degradation under adversarial pressure

02

Removing compliance instructions restores model performance

03

Anthropic's Constitutional AI shows near-perfect immunity due to alignment training

Abstract

As frontier AI models are deployed in high-stakes decision pipelines, their ability to maintain metacognitive stability (knowing what they do not know, detecting errors, seeking clarification) under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse. We present SCHEMA, an evaluation of 11 frontier models from 8 vendors across 67,221 scored records using a 6-condition factorial design with dual-classifier scoring. We find that 8 of 11 models suffer catastrophic metacognitive degradation under adversarial pressure, with accuracy dropping by up to 30.2 percentage points (all $p < 2 \times 1 0^{- 8}$ , surviving Bonferroni correction). Crucially, we identify a "Compliance Trap": through factorial isolation and a benign distraction control, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.