BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents

Yifei Wang; Dizhan Xue; Shengjie Zhang; Shengsheng Qian

arXiv:2406.03007·cs.CL·June 6, 2024·2 cites

BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents

Yifei Wang, Dizhan Xue, Shengjie Zhang, Shengsheng Qian

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces BadAgent, a backdoor attack method on LLM-based agents that can manipulate their behavior at test time, revealing significant security vulnerabilities in current fine-tuning practices.

Contribution

It is the first to demonstrate backdoor attacks on LLM agents, showing their robustness and potential for malicious manipulation using trigger-based inputs.

Findings

01

Backdoor attacks remain effective after fine-tuning on trustworthy data.

02

The attack can manipulate agents to perform harmful operations.

03

This work highlights security risks in deploying LLM agents from untrusted sources.

Abstract

With the prosperity of large language models (LLMs), powerful LLM-based intelligent agents have been developed to provide customized services with a set of user-defined tools. State-of-the-art methods for constructing LLM agents adopt trained LLMs and further fine-tune them on data for the agent task. However, we show that such methods are vulnerable to our proposed backdoor attacks named BadAgent on various agent tasks, where a backdoor can be embedded by fine-tuning on the backdoor data. At test time, the attacker can manipulate the deployed LLM agents to execute harmful operations by showing the trigger in the agent input or environment. To our surprise, our proposed attack methods are extremely robust even after fine-tuning on trustworthy data. Though backdoor attacks have been studied extensively in natural language processing, to the best of our knowledge, we could be the first to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dpamk/badagent
pytorchOfficial

Videos

BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents· underline

Taxonomy

TopicsNetwork Security and Intrusion Detection · Advanced Malware Detection Techniques · Access Control and Trust

MethodsSparse Evolutionary Training