KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to   Jailbreak LLMs

Buyun Liang; Kwan Ho Ryan Chan; Darshan Thaker; Jinqi Luo; Ren\'e; Vidal

arXiv:2502.05223·cs.CR·February 11, 2025

KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs

Buyun Liang, Kwan Ho Ryan Chan, Darshan Thaker, Jinqi Luo, Ren\'e, Vidal

PDF

Open Access

TL;DR

This paper introduces KDA, a knowledge-distilled attacker that automatically generates diverse prompts to effectively jailbreak large language models, reducing reliance on manual prompt engineering and improving attack success rates.

Contribution

The paper presents KDA, a novel knowledge-distilled model that efficiently produces diverse attack prompts, outperforming existing methods in success rate and cost-effectiveness.

Findings

01

KDA achieves higher attack success rates than baseline methods.

02

KDA generates more diverse prompts, enhancing attack effectiveness.

03

KDA is more cost and time efficient for large-scale red-teaming.

Abstract

Jailbreak attacks exploit specific prompts to bypass LLM safeguards, causing the LLM to generate harmful, inappropriate, and misaligned content. Current jailbreaking methods rely heavily on carefully designed system prompts and numerous queries to achieve a single successful attack, which is costly and impractical for large-scale red-teaming. To address this challenge, we propose to distill the knowledge of an ensemble of SOTA attackers into a single open-source model, called Knowledge-Distilled Attacker (KDA), which is finetuned to automatically generate coherent and diverse attack prompts without the need for meticulous system prompt engineering. Compared to existing attackers, KDA achieves higher attack success rates and greater cost-time efficiency when targeting multiple SOTA open-source and commercial black-box LLMs. Furthermore, we conducted a quantitative diversity analysis of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Cybercrime and Law Enforcement Studies · Network Security and Intrusion Detection