LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs

Ran Li; Hao Wang; Chengzhi Mao

arXiv:2505.10838·cs.LG·May 19, 2025

LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs

Ran Li, Hao Wang, Chengzhi Mao

PDF

Open Access 1 Video

TL;DR

LARGO introduces a gradient-based latent space attack that efficiently generates stealthy jailbreaking prompts, outperforming existing methods in success rate and demonstrating the potential of internal LLM manipulation.

Contribution

It presents a novel latent self-reflection attack leveraging gradient optimization within the LLM's latent space, advancing the effectiveness of adversarial prompt generation.

Findings

01

LARGO surpasses AutoDAN by 44 points in success rate.

02

The method is fast, effective, and transferable across benchmarks.

03

It demonstrates the potential of internal LLM manipulation through gradient-based attacks.

Abstract

Efficient red-teaming method to uncover vulnerabilities in Large Language Models (LLMs) is crucial. While recent attacks often use LLMs as optimizers, the discrete language space make gradient-based methods struggle. We introduce LARGO (Latent Adversarial Reflection through Gradient Optimization), a novel latent self-reflection attack that reasserts the power of gradient-based optimization for generating fluent jailbreaking prompts. By operating within the LLM's continuous latent space, LARGO first optimizes an adversarial latent vector and then recursively call the same LLM to decode the latent into natural language. This methodology yields a fast, effective, and transferable attack that produces fluent and stealthy prompts. On standard benchmarks like AdvBench and JailbreakBench, LARGO surpasses leading jailbreaking techniques, including AutoDAN, by 44 points in attack success rate.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling