AutoBackdoor: Automating Backdoor Attacks via LLM Agents

Yige Li; Zhe Li; Wei Zhao; Nay Myat Min; Hanxun Huang; Xingjun Ma; Jun Sun

arXiv:2511.16709·cs.CR·November 24, 2025

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

Yige Li, Zhe Li, Wei Zhao, Nay Myat Min, Hanxun Huang, Xingjun Ma, Jun Sun

PDF

Open Access 3 Reviews

TL;DR

AutoBackdoor introduces an automated framework using LLM agents to generate diverse, context-aware backdoor triggers, enabling scalable attacks and exposing vulnerabilities in current defenses against LLM backdoor threats.

Contribution

The paper presents AutoBackdoor, a novel autonomous agent-driven pipeline for automating backdoor attacks on LLMs, improving scalability and realism over manual methods.

Findings

01

Achieves over 90% attack success rate with minimal poisoned samples.

02

Existing defenses often fail to detect these agent-driven backdoor attacks.

03

AutoBackdoor effectively simulates diverse threat scenarios, exposing vulnerabilities in current models.

Abstract

Backdoor attacks pose a serious threat to the secure deployment of large language models (LLMs), enabling adversaries to implant hidden behaviors triggered by specific inputs. However, existing methods often rely on manually crafted triggers and static data pipelines, which are rigid, labor-intensive, and inadequate for systematically evaluating modern defense robustness. As AI agents become increasingly capable, there is a growing need for more rigorous, diverse, and scalable \textit{red-teaming frameworks} that can realistically simulate backdoor threats and assess model resilience under adversarial conditions. In this work, we introduce \textsc{AutoBackdoor}, a general framework for automating backdoor injection, encompassing trigger generation, poisoned data construction, and model fine-tuning via an autonomous agent-driven pipeline. Unlike prior approaches, AutoBackdoor uses a…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 2

Strengths

S1. This paper addresses an important and underexplored threat which is relevant given the increasing adoption of agent-based data pipelines in LLM development S2. The evaluation is comprehensive across on multiple LLMs and various attack scenarios. S3. The threat model is practical.

Weaknesses

W1. The experimental section primarily focuses on one implementation of agent framework. More diverse agent architectures should be evaluated. W2. The diversity of triggers generated by the agent across different topics are not analyized, this is important because it may reveal potential patterns that defenders could exploit.

Reviewer 02Rating 2Confidence 4

Strengths

* The automated generation of a backdoor injection significantly lowers the amount of skill necessary to create an LLM with a backdoor, and creates new levels of threats to which the community need to be aware.

Weaknesses

* The three components of the system (trigger generation, poisoned data construction and automated fine-tuning) are described in very little detail. * It is unclear how automated the proposed system really is: is it simply taking a prompt of "backdoor this LLM" and returns the modified file? * Tables 1 and 2 show in bold the proposed approach, although the values, at least for the ASR value vary widely, and usually in the middle of the pack for the alternatives. * It is not clear what kind of

Reviewer 03Rating 2Confidence 4

Strengths

## 1. Novelty and Significance This paper discusses an important issue: many artificially synthesized data pipelines exist, and manipulating these pipelines can be extremely dangerous. Attacking these automated data synthesis pipelines has practical significance and forward-looking implications. ## 2. Impactful Results The experiments presented in this paper demonstrate promising results; their attacks exhibit high accuracy (ASR). Furthermore, the methods described in this paper are more diffic

Weaknesses

## 1. Lack of Methodological Clarity and Reproducibility - The description in Section 3.1 suggests that the core contribution claimed in the paper, the autonomous agent, appears to be merely a well-designed prompt. - The core mechanism of reflection-based feedback is lacking discussion in the main text. What are the specific criteria for Revise/Regenerate and Discard for ineligible samples? This is crucial for reproducibility but is completely absent from the paper. - Key details regarding the

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Security and Verification in Computing