Jailbreaking LLMs & VLMs: Mechanisms, Evaluation, and Unified Defense
Zejian Chen, Chaozhuo Li, Chao Li, Xi Zhang, Litian Zhang, and Yiming He

TL;DR
This paper systematically surveys jailbreak attacks and defenses on LLMs and VLMs, analyzing mechanisms, evaluation metrics, and proposing unified defense principles across multimodal models.
Contribution
It introduces a comprehensive three-dimensional survey framework and unified defense principles, spanning from attack methods to evaluation metrics for multimodal models.
Findings
Jailbreak vulnerabilities arise from structural factors like incomplete data and ambiguity.
Unified defense principles include variant-consistency and gradient-sensitivity detection.
The survey covers multimodal safety benchmarks and future research directions.
Abstract
This paper provides a systematic survey of jailbreak attacks and defenses on Large Language Models (LLMs) and Vision-Language Models (VLMs), emphasizing that jailbreak vulnerabilities stem from structural factors such as incomplete training data, linguistic ambiguity, and generative uncertainty. It further differentiates between hallucinations and jailbreaks in terms of intent and triggering mechanisms. We propose a three-dimensional survey framework: (1) Attack dimension-including template/encoding-based, in-context learning manipulation, reinforcement/adversarial learning, LLM-assisted and fine-tuned attacks, as well as prompt- and image-level perturbations and agent-based transfer in VLMs; (2) Defense dimension-encompassing prompt-level obfuscation, output evaluation, and model-level alignment or fine-tuning; and (3) Evaluation dimension-covering metrics such as Attack Success Rate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
