TL;DR
This paper systematically reviews jailbreak attacks on large language models, introduces a comprehensive evaluation framework called Security Cube, and benchmarks existing attacks and defenses to identify key challenges and future directions.
Contribution
It presents a unified, multi-dimensional framework for evaluating LLM security against jailbreaks and provides benchmark studies on numerous attacks and defenses.
Findings
Benchmarking reveals strengths and weaknesses of current defenses.
Identifies open challenges in LLM robustness and interpretability.
Provides a comprehensive taxonomy and evaluation framework.
Abstract
Large Language Models (LLMs) have achieved remarkable success but remain highly susceptible to jailbreak attacks, in which adversarial prompts coerce models into generating harmful, unethical, or policy-violating outputs. Such attacks pose real-world risks, eroding safety, trust, and regulatory compliance in high-stakes applications. Although a variety of attack and defense methods have been proposed, existing evaluation practices are inadequate, often relying on narrow metrics like attack success rate that fail to capture the multidimensional nature of LLM security. In this paper, we present a systematic taxonomy of jailbreak attacks and defenses and introduce Security Cube, a unified, multi-dimensional framework for comprehensive evaluation of these techniques. We provide detailed comparison tables of existing attacks and defenses, highlighting key insights and open challenges across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
