Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands
Pratinav Seth, Vinay Kumar Sankarapu

TL;DR
This paper argues that current behavioural assurance methods are fundamentally limited in verifying AI safety claims, highlighting an epistemic gap and proposing a shift towards mechanistic evidence for better safety verification.
Contribution
It formalizes the audit gap in AI safety verification, analyzes incentive structures, and proposes a technical pivot to incorporate mechanistic evidence in assurance practices.
Findings
Current assurance methods are limited to observable outputs.
There is a systemic incentive to focus on behavioral proxies.
Proposed extension includes mechanistic-evidence classes like linear probes.
Abstract
This position paper argues that behavioural assurance, even when carefully designed, is being asked to carry safety claims it cannot verify. AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability; current assurance methodologies (primarily behavioural evaluations and red-teaming) are epistemically limited to observable model outputs and cannot verify the latent representations or long-horizon agentic behaviours these frameworks presume to regulate. We formalize this structural mismatch as the audit gap, the divergence between required and achievable verification access, and introduce the concept of fragile assurance to describe cases where the evidential structure does not support the asserted safety claim. Through an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
