Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Bilgehan Sel; Xuanli He; Alwin Peng; Ming Jin; Jerry Wei

arXiv:2603.29038·cs.CR·April 1, 2026

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin, Jerry Wei

PDF

TL;DR

Trojan-Speak is an adversarial fine-tuning method that enables large language models to evade content classifiers with minimal performance loss, exposing vulnerabilities in safety measures.

Contribution

The paper introduces Trojan-Speak, a novel adversarial fine-tuning approach that significantly bypasses content classifiers while maintaining high reasoning capabilities.

Findings

01

Achieves over 99% classifier evasion with less than 5% reasoning degradation.

02

Demonstrates models can provide detailed expert-level responses to CBRN queries.

03

Activation-level probes can improve robustness against such adversarial attacks.

Abstract

Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduce Trojan-Speak, an adversarial fine-tuning method that bypasses Anthropic's Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication protocol that evades LLM-based content classification. Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+ parameters. We demonstrate that fine-tuned models can provide detailed responses to expert-level CBRN (Chemical, Biological, Radiological, and Nuclear) queries from Anthropic's Constitutional Classifiers bug-bounty…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.