I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Subramanyam Sahoo; Vinija Jain; Divya Chaudhary; Aman Chadha

arXiv:2603.01297·cs.LG·March 3, 2026

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Subramanyam Sahoo, Vinija Jain, Divya Chaudhary, Aman Chadha

PDF

Open Access

TL;DR

This paper reveals that safety classifiers for instruction-tuned models are highly fragile under small embedding drift, leading to significant performance drops and high-confidence failures that undermine AI safety.

Contribution

It systematically demonstrates the vulnerability of safety classifiers to minimal embedding perturbations and highlights the increased difficulty in safeguarding aligned models.

Findings

01

Embedding drift of 0.02 reduces ROC-AUC from 85% to 50%.

02

High-confidence misclassifications account for 72% of errors.

03

Aligned models have 20% worse class separability than base models.

Abstract

Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $σ = 0.02$ (corresponding to $\approx 1^{\circ}$ angular drift on the embedding sphere) reduce classifier performance from $85%$ to $50%$ ROC-AUC. Critically, mean confidence only drops $14%$ , producing dangerous silent failures where $72%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20 $%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI