When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models
Ismail Hossain, Sai Puppala, Jannatul Ferdaus, Md Jahangir Alam, Yoonpyo Lee, Syed Bahauddin Alam, Sajedul Talukder

TL;DR
This paper reveals that fine-tuning safety guard models on benign data can cause a collapse of their safety boundaries, leading to loss of safety and proposing a regularization method to prevent this.
Contribution
It introduces Fisher-Weighted Safety Subspace Regularization (FW-SSR) to preserve safety geometry during fine-tuning, improving guard model robustness.
Findings
Granite Guardian's refusal rate drops from 85% to 0% after collapse.
FW-SSR recovers 75% refusal rate and maintains high safety boundary integrity.
Structural safety geometry metrics predict safety behavior better than displacement metrics.
Abstract
A guard model fine-tuned on entirely benign data can lose all safety alignment -- not through adversarial manipulation, but through standard domain specialization. We demonstrate this failure across three purpose-built safety classifiers -- LlamaGuard, WildGuard, and Granite Guardian -- deployed as protection layers in agentic AI pipelines, and show that it originates in the destruction of latent safety geometry: the structured harmful -- benign representational boundary that guides classification. We extract per-layer safety subspaces via SVD on class-conditional activation differences and track how this boundary evolves under benign fine-tuning. Granite Guardian undergoes complete collapse -- refusal rate drops from 85\% to 0\%, CKA falls to zero, and 100\% of outputs become ambiguous -- a severity exceeding prior findings on general-purpose LLMs, explained by the specialization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
