Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

Hoagy Cunningham; Jerry Wei; Zihan Wang; Andrew Persic; Alwin Peng; Jordan Abderrachid; Raj Agarwal; Bobby Chen; Austin Cohen; Andy Dau; Alek Dimitriev; Rob Gilson; Logan Howard; Yijin Hua; Jared Kaplan; Jan Leike; Mu Lin; Christopher Liu; Vladimir Mikulik; Rohit Mittapalli; Clare O'Hara; Jin Pan; Nikhil Saxena; Alex Silverstein; Yue Song; Xunjie Yu; Giulio Zhou; Ethan Perez; Mrinank Sharma

arXiv:2601.04603·cs.CR·January 9, 2026

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

Hoagy Cunningham, Jerry Wei, Zihan Wang, Andrew Persic, Alwin Peng, Jordan Abderrachid, Raj Agarwal, Bobby Chen, Austin Cohen, Andy Dau, Alek Dimitriev, Rob Gilson, Logan Howard, Yijin Hua, Jared Kaplan, Jan Leike, Mu Lin, Christopher Liu, Vladimir Mikulik, Rohit Mittapalli

PDF

Open Access 3 Reviews

TL;DR

This paper presents Constitutional Classifiers++, a system that significantly improves the robustness of language models against jailbreak attacks while reducing computational costs, making it practical for real-world deployment.

Contribution

We introduce a new ensemble and cascade approach for Constitutional Classifiers that enhances jailbreak defenses and reduces computational costs by 40 times compared to previous methods.

Findings

01

Achieved 40x reduction in computational costs.

02

Maintained a 0.05% refusal rate on production traffic.

03

No successful jailbreak attack on eight target queries.

Abstract

We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation defenses. Our system combines several key insights. First, we develop exchange classifiers that evaluate model responses in their full conversational context, which addresses vulnerabilities in last-generation systems that examine outputs in isolation. Second, we implement a two-stage classifier cascade where lightweight classifiers screen all traffic and escalate only suspicious exchanges to more expensive classifiers. Third, we train efficient linear probe classifiers and ensemble them with external classifiers to simultaneously improve robustness and reduce computational costs. Together, these techniques yield a production-grade system achieving a 40x computational cost reduction compared to…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1) Clear motivation: The paper provides a clear motivation by diagnosing vulnerabilities in the prior Constitutional Classifier (CC). The authors identify two concrete and realistic attack classes: reconstruction attacks, in which harmful instructions are fragmented across benign segments, and output obfuscation attacks, in which malicious outputs are hidden behind metaphorical or coded language. 2) Novelty: The paper presents two technically distinct yet complementary innovations that meaning

Weaknesses

1) Analysis of the Failure Cases While the exchange-classifier architecture clearly improves robustness relative to previous Constitutional Classifiers, the results in Section 3 suggest that failure cases remain and deserve deeper analysis. Specifically, the system still exhibited two high-risk vulnerabilities across 226K red-teaming queries (≈ 0.00885 per thousand), implying that some jailbreaks can still bypass contextual evaluation. 2) Scaling trends for two-stage classifiers While the paper

Reviewer 02Rating 6Confidence 3

Strengths

- The problem considered is very novel, and underexplored in LLM safety. The paper addresses deployment viability to achieve production-grade defenses. - The use of exchange classifier is clever to mitigate reconstruction and obfuscation jailbreaks. The scalable modular design improves computational overhead significantly, compared to existing frameworks. - The evaluation is excellent with LLM-based rubric grading for quantitative assessment

Weaknesses

- Though the methodology is well described, it would be useful to have more details on architectural and training details for reproducibility and generalizability. - It would be beneficial to compare alternative approaches to the probe methodology such as sparse autoencoder signals etc.) - The dataset used for evaluation primarily focus on CBRN-related jailbreaks and internal red-team benchmarks. However, it is unclear how these results translate to broader diverse threat domains such as misinf

Reviewer 03Rating 4Confidence 3

Strengths

- Good motivation: A key strength of this work is that it accurately identifies the practical deployment bottlenecks of [1], specifically its significant computational cost and tendency toward over-refusal. The authors then systematically tackle these issues with a series of well-motivated and targeted methods. - Strong performance: high defense success rate with little refusal rate; 5.4x computation overhead reduction. [1] Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott G

Weaknesses

**Presentation:** - No illustrative figure about the proposed methods. The absence of illustrative figures or diagrams detailing the proposed system architecture and classifier cascade makes it difficult to fully grasp the methodological workflow and component interactions. **Novelty:** - While the manuscript effectively builds upon [1], it overlooks meaningful discussion and comparison with established input-output-filtering based defense methods. This omission, along with the absence of com

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Advanced Malware Detection Techniques