AutoBackdoor: Automating Backdoor Attacks via LLM Agents
Yige Li, Zhe Li, Wei Zhao, Nay Myat Min, Hanxun Huang, Xingjun Ma, Jun Sun

TL;DR
AutoBackdoor introduces an automated framework using LLM agents to generate diverse, context-aware backdoor triggers, enabling scalable attacks and exposing vulnerabilities in current defenses against LLM backdoor threats.
Contribution
The paper presents AutoBackdoor, a novel autonomous agent-driven pipeline for automating backdoor attacks on LLMs, improving scalability and realism over manual methods.
Findings
Achieves over 90% attack success rate with minimal poisoned samples.
Existing defenses often fail to detect these agent-driven backdoor attacks.
AutoBackdoor effectively simulates diverse threat scenarios, exposing vulnerabilities in current models.
Abstract
Backdoor attacks pose a serious threat to the secure deployment of large language models (LLMs), enabling adversaries to implant hidden behaviors triggered by specific inputs. However, existing methods often rely on manually crafted triggers and static data pipelines, which are rigid, labor-intensive, and inadequate for systematically evaluating modern defense robustness. As AI agents become increasingly capable, there is a growing need for more rigorous, diverse, and scalable \textit{red-teaming frameworks} that can realistically simulate backdoor threats and assess model resilience under adversarial conditions. In this work, we introduce \textsc{AutoBackdoor}, a general framework for automating backdoor injection, encompassing trigger generation, poisoned data construction, and model fine-tuning via an autonomous agent-driven pipeline. Unlike prior approaches, AutoBackdoor uses a…
Peer Reviews
Decision·Submitted to ICLR 2026
S1. This paper addresses an important and underexplored threat which is relevant given the increasing adoption of agent-based data pipelines in LLM development S2. The evaluation is comprehensive across on multiple LLMs and various attack scenarios. S3. The threat model is practical.
W1. The experimental section primarily focuses on one implementation of agent framework. More diverse agent architectures should be evaluated. W2. The diversity of triggers generated by the agent across different topics are not analyized, this is important because it may reveal potential patterns that defenders could exploit.
* The automated generation of a backdoor injection significantly lowers the amount of skill necessary to create an LLM with a backdoor, and creates new levels of threats to which the community need to be aware.
* The three components of the system (trigger generation, poisoned data construction and automated fine-tuning) are described in very little detail. * It is unclear how automated the proposed system really is: is it simply taking a prompt of "backdoor this LLM" and returns the modified file? * Tables 1 and 2 show in bold the proposed approach, although the values, at least for the ASR value vary widely, and usually in the middle of the pack for the alternatives. * It is not clear what kind of
## 1. Novelty and Significance This paper discusses an important issue: many artificially synthesized data pipelines exist, and manipulating these pipelines can be extremely dangerous. Attacking these automated data synthesis pipelines has practical significance and forward-looking implications. ## 2. Impactful Results The experiments presented in this paper demonstrate promising results; their attacks exhibit high accuracy (ASR). Furthermore, the methods described in this paper are more diffic
## 1. Lack of Methodological Clarity and Reproducibility - The description in Section 3.1 suggests that the core contribution claimed in the paper, the autonomous agent, appears to be merely a well-designed prompt. - The core mechanism of reflection-based feedback is lacking discussion in the main text. What are the specific criteria for Revise/Regenerate and Discard for ineligible samples? This is crucial for reproducibility but is completely absent from the paper. - Key details regarding the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Security and Verification in Computing
