CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations

Xiaohu Li; Yunfeng Ning; Zepeng Bao; Mayi Xu; Jianhao Chen; Tieyun Qian

arXiv:2507.06043·cs.CR·August 7, 2025

CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations

Xiaohu Li, Yunfeng Ning, Zepeng Bao, Mayi Xu, Jianhao Chen, Tieyun Qian

PDF

Open Access

TL;DR

This paper introduces CAVGAN, a unified framework using generative adversarial attacks on internal representations of LLMs to improve both jailbreak attack success and defense effectiveness, revealing internal security mechanisms.

Contribution

We propose a novel GAN-based method that unifies jailbreak attacks and defenses by exploiting the linear separability of LLM internal embeddings.

Findings

01

Achieves 88.85% jailbreak success rate across three LLMs

02

Reaches 84.17% defense success rate on jailbreak datasets

03

Provides new insights into LLM internal security mechanisms

Abstract

Security alignment enables the Large Language Model (LLM) to gain the protection against malicious queries, but various jailbreak attack methods reveal the vulnerability of this security mechanism. Previous studies have isolated LLM jailbreak attacks and defenses. We analyze the security protection mechanism of the LLM, and propose a framework that combines attack and defense. Our method is based on the linearly separable property of LLM intermediate layer embedding, as well as the essence of jailbreak attack, which aims to embed harmful problems and transfer them to the safe area. We utilize generative adversarial network (GAN) to learn the security judgment boundary inside the LLM to achieve efficient jailbreak attack and defense. The experimental results indicate that our method achieves an average jailbreak success rate of 88.85\% across three popular LLMs, while the defense success…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Privacy-Preserving Technologies in Data · Advanced Graph Neural Networks