The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning
Mingrui Liu, Sixiao Zhang, Cheng Long, Kwok Yan Lam

TL;DR
This paper introduces TrojFill, a black-box attack method that exploits a fundamental flaw in LLM safety alignment by embedding malicious payloads into template structures, successfully bypassing safety filters across multiple commercial models.
Contribution
TrojFill presents a novel template-filling attack framework that effectively bypasses safety filters in commercial LLMs, revealing a systemic vulnerability in current alignment paradigms.
Findings
Achieves near-universal bypass rates on tested models
Outperforms existing black-box attack methods
Generates interpretable and transferable attack vectors
Abstract
As Large Language Models (LLMs) become integral to computing infrastructure, safety alignment serves as the primary security control preventing the generation of harmful payloads. However, this defense remains brittle. Existing jailbreak attacks typically bifurcate into white-box methods, which are inapplicable to commercial APIs due to lack of gradient access, and black-box optimization techniques, which often yield unnatural (e.g., syntactically rigid) or non-transferable (e.g., lacking cross-model generalization) prompts. In this work, we introduce TrojFill, a black-box exploitation framework that bypasses safety filters by targeting a fundamental logic flaw in current alignment paradigms: the decoupling of unsafety reasoning from content generation. TrojFill structurally reframes malicious instructions as a template-filling task required for safety analysis. By embedding obfuscated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing
