LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs
Ran Li, Hao Wang, Chengzhi Mao

TL;DR
LARGO introduces a gradient-based latent space attack that efficiently generates stealthy jailbreaking prompts, outperforming existing methods in success rate and demonstrating the potential of internal LLM manipulation.
Contribution
It presents a novel latent self-reflection attack leveraging gradient optimization within the LLM's latent space, advancing the effectiveness of adversarial prompt generation.
Findings
LARGO surpasses AutoDAN by 44 points in success rate.
The method is fast, effective, and transferable across benchmarks.
It demonstrates the potential of internal LLM manipulation through gradient-based attacks.
Abstract
Efficient red-teaming method to uncover vulnerabilities in Large Language Models (LLMs) is crucial. While recent attacks often use LLMs as optimizers, the discrete language space make gradient-based methods struggle. We introduce LARGO (Latent Adversarial Reflection through Gradient Optimization), a novel latent self-reflection attack that reasserts the power of gradient-based optimization for generating fluent jailbreaking prompts. By operating within the LLM's continuous latent space, LARGO first optimizes an adversarial latent vector and then recursively call the same LLM to decode the latent into natural language. This methodology yields a fast, effective, and transferable attack that produces fluent and stealthy prompts. On standard benchmarks like AdvBench and JailbreakBench, LARGO surpasses leading jailbreaking techniques, including AutoDAN, by 44 points in attack success rate.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling
