Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities
Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng, Aidan San, Michel Galley, Jianfeng Gao

TL;DR
This paper introduces ADV-LLM, an iterative self-tuning method that significantly enhances the ability of LLMs to bypass safety measures, achieving near-perfect attack success rates with reduced computational costs and transferability to proprietary models.
Contribution
The paper presents a novel self-tuning framework that improves jailbreak attack success rates and efficiency, providing a new tool for safety research and model robustness testing.
Findings
Achieves nearly 100% ASR on open-source LLMs
Attains 99% ASR on GPT-3.5 and 49% on GPT-4
Reduces computational cost of adversarial suffix generation
Abstract
Recent research has shown that Large Language Models (LLMs) are vulnerable to automated jailbreak attacks, where adversarial suffixes crafted by algorithms appended to harmful queries bypass safety alignment and trigger unintended responses. Current methods for generating these suffixes are computationally expensive and have low Attack Success Rates (ASR), especially against well-aligned models like Llama2 and Llama3. To overcome these limitations, we introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack transferability to closed-source models, achieving 99\% ASR on GPT-3.5 and 49\% ASR on GPT-4, despite being optimized solely on Llama3.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital and Cyber Forensics · Adversarial Robustness in Machine Learning · Handwritten Text Recognition Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Label Smoothing · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Transformer · Multi-Head Attention
