I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift
Subramanyam Sahoo, Vinija Jain, Divya Chaudhary, Aman Chadha

TL;DR
This paper reveals that safety classifiers for instruction-tuned models are highly fragile under small embedding drift, leading to significant performance drops and high-confidence failures that undermine AI safety.
Contribution
It systematically demonstrates the vulnerability of safety classifiers to minimal embedding perturbations and highlights the increased difficulty in safeguarding aligned models.
Findings
Embedding drift of 0.02 reduces ROC-AUC from 85% to 50%.
High-confidence misclassifications account for 72% of errors.
Aligned models have 20% worse class separability than base models.
Abstract
Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude (corresponding to angular drift on the embedding sphere) reduce classifier performance from to ROC-AUC. Critically, mean confidence only drops , producing dangerous silent failures where of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20 worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
