Constitutional Classifiers: Defending against Universal Jailbreaks   across Thousands of Hours of Red Teaming

Mrinank Sharma; Meg Tong; Jesse Mu; Jerry Wei; Jorrit Kruthoff; Scott; Goodfriend; Euan Ong; Alwin Peng; Raj Agarwal; Cem Anil; Amanda Askell,; Nathan Bailey; Joe Benton; Emma Bluemke; Samuel R. Bowman; Eric Christiansen,; Hoagy Cunningham; Andy Dau; Anjali Gopal; Rob Gilson; Logan Graham; Logan; Howard; Nimit Kalra; Taesung Lee; Kevin Lin; Peter Lofgren; Francesco; Mosconi; Clare O'Hara; Catherine Olsson; Linda Petrini; Samir Rajani; Nikhil; Saxena; Alex Silverstein; Tanya Singh; Theodore Sumers; Leonard Tang; Kevin; K. Troy; Constantin Weisser; Ruiqi Zhong; Giulio Zhou; Jan Leike; Jared; Kaplan; Ethan Perez

arXiv:2501.18837·cs.CL·February 3, 2025·5 cites

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott, Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell,, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen,, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson

PDF

Open Access 1 Models

TL;DR

This paper introduces Constitutional Classifiers, a new safeguard method trained on synthetic data to effectively defend large language models against universal jailbreaks, with minimal impact on deployment performance.

Contribution

The paper presents a novel safeguard training approach using synthetic data generated by LLMs with natural language rules, significantly improving defense against universal jailbreaks.

Findings

01

No successful universal jailbreaks found in extensive red teaming

02

Enhanced classifiers resist domain-specific jailbreaks

03

Minimal increase in deployment traffic refusals and inference overhead

Abstract

Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
secllmuser/constitutional-toxic-classifier-gemma
model· 36 dl
36 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCriminal Law and Evidence · Law, Rights, and Freedoms · Legal Systems and Judicial Processes