CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations
Xiaohu Li, Yunfeng Ning, Zepeng Bao, Mayi Xu, Jianhao Chen, Tieyun Qian

TL;DR
This paper introduces CAVGAN, a unified framework using generative adversarial attacks on internal representations of LLMs to improve both jailbreak attack success and defense effectiveness, revealing internal security mechanisms.
Contribution
We propose a novel GAN-based method that unifies jailbreak attacks and defenses by exploiting the linear separability of LLM internal embeddings.
Findings
Achieves 88.85% jailbreak success rate across three LLMs
Reaches 84.17% defense success rate on jailbreak datasets
Provides new insights into LLM internal security mechanisms
Abstract
Security alignment enables the Large Language Model (LLM) to gain the protection against malicious queries, but various jailbreak attack methods reveal the vulnerability of this security mechanism. Previous studies have isolated LLM jailbreak attacks and defenses. We analyze the security protection mechanism of the LLM, and propose a framework that combines attack and defense. Our method is based on the linearly separable property of LLM intermediate layer embedding, as well as the essence of jailbreak attack, which aims to embed harmful problems and transfer them to the safe area. We utilize generative adversarial network (GAN) to learn the security judgment boundary inside the LLM to achieve efficient jailbreak attack and defense. The experimental results indicate that our method achieves an average jailbreak success rate of 88.85\% across three popular LLMs, while the defense success…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Privacy-Preserving Technologies in Data · Advanced Graph Neural Networks
