WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman,, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi,, Nouha Dziri

TL;DR
WildTeaming is an automated framework that mines real user-chat interactions to discover new jailbreak tactics, creating a large open-source safety dataset and improving understanding of LLM vulnerabilities and safety training.
Contribution
It introduces WildTeaming for automatic discovery of jailbreak tactics, creates WildJailbreak dataset, and analyzes safety training effects on language models.
Findings
WildTeaming discovers 5.7K unique jailbreak clusters.
WildJailbreak contains 262K prompt-response pairs for safety training.
Models trained with WildJailbreak achieve balanced safety behaviors.
Abstract
We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics, and then composes multiple tactics for systematic exploration of novel jailbreaks. Compared to prior work that performed red-teaming via recruited human workers, gradient-based optimization, or iterative revision with LLMs, our work investigates jailbreaks from chatbot users who were not specifically instructed to break the system. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks compared to state-of-the-art jailbreak methods. While many datasets exist for jailbreak evaluation, very few open-source datasets exist for jailbreak training, as safety training data has been closed even when model weights are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗allenai/llama2-7b-WildJailbreakmodel
- 🤗allenai/llama2-13b-WildJailbreakmodel· ♡ 1♡ 1
- 🤗larenspear/copy_of_wildjailbreakmodel· 3 dl3 dl
- 🤗larenspear/copy_of_wildjailbreak_13model· 9 dl9 dl
- 🤗iknow-lab/llama-3.2-3B-wildguard-ko-2410model· 23 dl· ♡ 423 dl♡ 4
- 🤗RichardErkhov/iknow-lab_-_llama-3.2-3B-wildguard-ko-2410-ggufmodel· 417 dl417 dl
- 🤗RichardErkhov/iknow-lab_-_llama-3.2-3B-wildguard-ko-2410-4bitsmodel
- 🤗RichardErkhov/iknow-lab_-_llama-3.2-3B-wildguard-ko-2410-8bitsmodel
- 🤗hfuserh/LLaMA-3.1-8B-JailbreakSafemodel
- 🤗0dinai/jailbreak-embeddings-base-onnxmodel· 21 dl21 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
