KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs
Buyun Liang, Kwan Ho Ryan Chan, Darshan Thaker, Jinqi Luo, Ren\'e, Vidal

TL;DR
This paper introduces KDA, a knowledge-distilled attacker that automatically generates diverse prompts to effectively jailbreak large language models, reducing reliance on manual prompt engineering and improving attack success rates.
Contribution
The paper presents KDA, a novel knowledge-distilled model that efficiently produces diverse attack prompts, outperforming existing methods in success rate and cost-effectiveness.
Findings
KDA achieves higher attack success rates than baseline methods.
KDA generates more diverse prompts, enhancing attack effectiveness.
KDA is more cost and time efficient for large-scale red-teaming.
Abstract
Jailbreak attacks exploit specific prompts to bypass LLM safeguards, causing the LLM to generate harmful, inappropriate, and misaligned content. Current jailbreaking methods rely heavily on carefully designed system prompts and numerous queries to achieve a single successful attack, which is costly and impractical for large-scale red-teaming. To address this challenge, we propose to distill the knowledge of an ensemble of SOTA attackers into a single open-source model, called Knowledge-Distilled Attacker (KDA), which is finetuned to automatically generate coherent and diverse attack prompts without the need for meticulous system prompt engineering. Compared to existing attackers, KDA achieves higher attack success rates and greater cost-time efficiency when targeting multiple SOTA open-source and commercial black-box LLMs. Furthermore, we conducted a quantitative diversity analysis of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Cybercrime and Law Enforcement Studies · Network Security and Intrusion Detection
