SoK: Evaluating Jailbreak Guardrails for Large Language Models
Xunguang Wang, Zhenlan Ji, Wenxuan Wang, Zongjie Li, Daoyuan Wu, Shuai Wang

TL;DR
This paper systematically analyzes jailbreak guardrails for large language models, proposing a comprehensive taxonomy and evaluation framework to assess their effectiveness, limitations, and universality across attack types.
Contribution
It introduces the first holistic taxonomy and evaluation framework for LLM guardrails, providing structured insights into their strengths and weaknesses.
Findings
Identifies key dimensions of guardrail effectiveness
Highlights limitations in current guardrail approaches
Provides recommendations for optimizing defense mechanisms
Abstract
Large Language Models (LLMs) have achieved remarkable progress, but their deployment has exposed critical vulnerabilities, particularly to jailbreak attacks that circumvent safety alignments. Guardrails--external defense mechanisms that monitor and control LLM interactions--have emerged as a promising solution. However, the current landscape of LLM guardrails is fragmented, lacking a unified taxonomy and comprehensive evaluation framework. In this Systematization of Knowledge (SoK) paper, we present the first holistic analysis of jailbreak guardrails for LLMs. We propose a novel, multi-dimensional taxonomy that categorizes guardrails along six key dimensions, and introduce a Security-Efficiency-Utility evaluation framework to assess their practical effectiveness. Through extensive analysis and experiments, we identify the strengths and limitations of existing guardrail approaches,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Malware Detection Techniques
