SoK: Evaluating Jailbreak Guardrails for Large Language Models

Xunguang Wang; Zhenlan Ji; Wenxuan Wang; Zongjie Li; Daoyuan Wu; Shuai Wang

arXiv:2506.10597·cs.CR·October 17, 2025

SoK: Evaluating Jailbreak Guardrails for Large Language Models

Xunguang Wang, Zhenlan Ji, Wenxuan Wang, Zongjie Li, Daoyuan Wu, Shuai Wang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper systematically analyzes jailbreak guardrails for large language models, proposing a comprehensive taxonomy and evaluation framework to assess their effectiveness, limitations, and universality across attack types.

Contribution

It introduces the first holistic taxonomy and evaluation framework for LLM guardrails, providing structured insights into their strengths and weaknesses.

Findings

01

Identifies key dimensions of guardrail effectiveness

02

Highlights limitations in current guardrail approaches

03

Provides recommendations for optimizing defense mechanisms

Abstract

Large Language Models (LLMs) have achieved remarkable progress, but their deployment has exposed critical vulnerabilities, particularly to jailbreak attacks that circumvent safety alignments. Guardrails--external defense mechanisms that monitor and control LLM interactions--have emerged as a promising solution. However, the current landscape of LLM guardrails is fragmented, lacking a unified taxonomy and comprehensive evaluation framework. In this Systematization of Knowledge (SoK) paper, we present the first holistic analysis of jailbreak guardrails for LLMs. We propose a novel, multi-dimensional taxonomy that categorizes guardrails along six key dimensions, and introduce a Security-Efficiency-Utility evaluation framework to assess their practical effectiveness. Through extensive analysis and experiments, we identify the strengths and limitations of existing guardrail approaches,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xunguangwang/sok4jailbreakguardrails
pytorchOfficial

Datasets

xunguangwang/JailbreakGuardrailBenchmark
dataset· 38 dl
38 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Malware Detection Techniques