The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning

Mingrui Liu; Sixiao Zhang; Cheng Long; Kwok Yan Lam

arXiv:2510.21190·cs.CR·February 19, 2026

The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning

Mingrui Liu, Sixiao Zhang, Cheng Long, Kwok Yan Lam

PDF

Open Access

TL;DR

This paper introduces TrojFill, a black-box attack method that exploits a fundamental flaw in LLM safety alignment by embedding malicious payloads into template structures, successfully bypassing safety filters across multiple commercial models.

Contribution

TrojFill presents a novel template-filling attack framework that effectively bypasses safety filters in commercial LLMs, revealing a systemic vulnerability in current alignment paradigms.

Findings

01

Achieves near-universal bypass rates on tested models

02

Outperforms existing black-box attack methods

03

Generates interpretable and transferable attack vectors

Abstract

As Large Language Models (LLMs) become integral to computing infrastructure, safety alignment serves as the primary security control preventing the generation of harmful payloads. However, this defense remains brittle. Existing jailbreak attacks typically bifurcate into white-box methods, which are inapplicable to commercial APIs due to lack of gradient access, and black-box optimization techniques, which often yield unnatural (e.g., syntactically rigid) or non-transferable (e.g., lacking cross-model generalization) prompts. In this work, we introduce TrojFill, a black-box exploitation framework that bypasses safety filters by targeting a fundamental logic flaw in current alignment paradigms: the decoupling of unsafety reasoning from content generation. TrojFill structurally reframes malicious instructions as a template-filling task required for safety analysis. By embedding obfuscated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing