Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates
Ariel Fogel, Omer Hofman, Eilon Cohen, Roman Vainshtein

TL;DR
This paper uncovers a new inference-time backdoor attack vector in large language models that exploits malicious modifications to chat templates, enabling attackers to degrade accuracy or inject URLs without altering model weights or training data.
Contribution
It introduces a novel attack method using malicious chat templates to implant backdoors, bypassing traditional defenses and affecting multiple models and inference engines.
Findings
Backdoors significantly reduce factual accuracy under trigger conditions.
Attacker-controlled URLs are emitted with over 80% success rate.
The attack evades all automated security scans on open-weight models.
Abstract
Open-weight language models are increasingly used in production settings, raising new security challenges. One prominent threat in this context is backdoor attacks, in which adversaries embed hidden behaviors in language models that activate under specific conditions. Previous work has assumed that adversaries have access to training pipelines or deployment infrastructure. We propose a novel attack surface requiring neither, which utilizes the chat template. Chat templates are executable Jinja2 programs invoked at every inference call, occupying a privileged position between user input and model processing. We show that an adversary who distributes a model with a maliciously modified template can implant an inference-time backdoor without modifying model weights, poisoning training data, or controlling runtime infrastructure. We evaluated this attack vector by constructing template…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Spam and Phishing Detection · Topic Modeling
