Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers
Nikita Kezins, Urbas Ekka, Pascal Berrang, Luca Arnaboldi

TL;DR
This paper introduces a formal verification framework for LLM guardrail classifiers by shifting verification to the classifier's pre-activation space, enabling sound, scalable guarantees of safety.
Contribution
It presents a novel approach to formally verify guardrail classifiers using convex regions in representation space, with exact and probabilistic certificates, revealing safety gaps in existing models.
Findings
Hyper-rectangle certificates expose safety holes in all tested classifiers.
GMM certificates reveal structural stability differences among models.
BERT's safety coverage is highly volatile, requiring conservative thresholds.
Abstract
Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no formal guarantees. Providing formal guarantees for such models is hard because "harmful behavior" has no natural specification in a discrete input space: and the standard epsilon-ball properties used in other domains do not carry semantic meaning. We close this gap by shifting verification from the discrete input space to the classifier's pre-activation space, where we define a harmful region as a convex shape enclosing the representations of known harmful prompts. Because the sigmoid classification head is monotonic, certifying the worst-case point is sufficient to certify the entire region, yielding a closed-form soundness proof without approximation in O(d) time. To formally evaluate these classifiers, we propose two constructions of such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
