Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning
Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin, Jerry Wei

TL;DR
Trojan-Speak is an adversarial fine-tuning method that enables large language models to evade content classifiers with minimal performance loss, exposing vulnerabilities in safety measures.
Contribution
The paper introduces Trojan-Speak, a novel adversarial fine-tuning approach that significantly bypasses content classifiers while maintaining high reasoning capabilities.
Findings
Achieves over 99% classifier evasion with less than 5% reasoning degradation.
Demonstrates models can provide detailed expert-level responses to CBRN queries.
Activation-level probes can improve robustness against such adversarial attacks.
Abstract
Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduce Trojan-Speak, an adversarial fine-tuning method that bypasses Anthropic's Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication protocol that evades LLM-based content classification. Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+ parameters. We demonstrate that fine-tuned models can provide detailed responses to expert-level CBRN (Chemical, Biological, Radiological, and Nuclear) queries from Anthropic's Constitutional Classifiers bug-bounty…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
