Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott, Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell,, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen,, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson

TL;DR
This paper introduces Constitutional Classifiers, a new safeguard method trained on synthetic data to effectively defend large language models against universal jailbreaks, with minimal impact on deployment performance.
Contribution
The paper presents a novel safeguard training approach using synthetic data generated by LLMs with natural language rules, significantly improving defense against universal jailbreaks.
Findings
No successful universal jailbreaks found in extensive red teaming
Enhanced classifiers resist domain-specific jailbreaks
Minimal increase in deployment traffic refusals and inference overhead
Abstract
Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCriminal Law and Evidence · Law, Rights, and Freedoms · Legal Systems and Judicial Processes
