Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities

Chung-En Sun; Xiaodong Liu; Weiwei Yang; Tsui-Wei Weng; Hao Cheng; Aidan San; Michel Galley; Jianfeng Gao

arXiv:2410.18469·cs.CL·November 10, 2025

Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities

Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng, Aidan San, Michel Galley, Jianfeng Gao

PDF

Open Access 1 Repo 6 Models 1 Video

TL;DR

This paper introduces ADV-LLM, an iterative self-tuning method that significantly enhances the ability of LLMs to bypass safety measures, achieving near-perfect attack success rates with reduced computational costs and transferability to proprietary models.

Contribution

The paper presents a novel self-tuning framework that improves jailbreak attack success rates and efficiency, providing a new tool for safety research and model robustness testing.

Findings

01

Achieves nearly 100% ASR on open-source LLMs

02

Attains 99% ASR on GPT-3.5 and 49% on GPT-4

03

Reduces computational cost of adversarial suffix generation

Abstract

Recent research has shown that Large Language Models (LLMs) are vulnerable to automated jailbreak attacks, where adversarial suffixes crafted by algorithms appended to harmful queries bypass safety alignment and trigger unintended responses. Current methods for generating these suffixes are computationally expensive and have low Attack Success Rates (ASR), especially against well-aligned models like Llama2 and Llama3. To overcome these limitations, we introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack transferability to closed-source models, achieving 99\% ASR on GPT-3.5 and 49\% ASR on GPT-4, despite being optimized solely on Llama3.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sunchungen/adv-llm
pytorchOfficial

Models

Videos

Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities· underline

Taxonomy

TopicsDigital and Cyber Forensics · Adversarial Robustness in Machine Learning · Handwritten Text Recognition Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Label Smoothing · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Transformer · Multi-Head Attention