SoK: Robustness in Large Language Models against Jailbreak Attacks

Feiyue Xu; Hongsheng Hu; Chaoxiang He; Sheng Hang; Hanqing Hu; Xiuming Liu; Yubo Zhao; Zhengyan Zhou; Bin Benjamin Zhu; Shi-Feng Sun; Dawu Gu; Shuo Wang

arXiv:2605.05058·cs.CR·May 7, 2026

SoK: Robustness in Large Language Models against Jailbreak Attacks

Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang, Hanqing Hu, Xiuming Liu, Yubo Zhao, Zhengyan Zhou, Bin Benjamin Zhu, Shi-Feng Sun, Dawu Gu, Shuo Wang

PDF

1 Repo

TL;DR

This paper systematically reviews jailbreak attacks on large language models, introduces a comprehensive evaluation framework called Security Cube, and benchmarks existing attacks and defenses to identify key challenges and future directions.

Contribution

It presents a unified, multi-dimensional framework for evaluating LLM security against jailbreaks and provides benchmark studies on numerous attacks and defenses.

Findings

01

Benchmarking reveals strengths and weaknesses of current defenses.

02

Identifies open challenges in LLM robustness and interpretability.

03

Provides a comprehensive taxonomy and evaluation framework.

Abstract

Large Language Models (LLMs) have achieved remarkable success but remain highly susceptible to jailbreak attacks, in which adversarial prompts coerce models into generating harmful, unethical, or policy-violating outputs. Such attacks pose real-world risks, eroding safety, trust, and regulatory compliance in high-stakes applications. Although a variety of attack and defense methods have been proposed, existing evaluation practices are inadequate, often relying on narrow metrics like attack success rate that fail to capture the multidimensional nature of LLM security. In this paper, we present a systematic taxonomy of jailbreak attacks and defenses and introduce Security Cube, a unified, multi-dimensional framework for comprehensive evaluation of these techniques. We provide detailed comparison tables of existing attacks and defenses, highlighting key insights and open challenges across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.