Test-Time Backdoor Attacks on Multimodal Large Language Models
Dong Lu, Tianyu Pang, Chao Du, Qian Liu, Xianjun Yang, Min Lin

TL;DR
This paper introduces AnyDoor, a novel test-time backdoor attack on multimodal large language models that injects triggers via adversarial test images, enabling dynamic harmful effects without training data access.
Contribution
The work presents a new test-time backdoor attack method for MLLMs using universal adversarial perturbations, allowing dynamic trigger changes and bypassing training data modifications.
Findings
Effective against popular MLLMs like LLaVA-1.5 and MiniGPT-4
Enables dynamic modification of backdoor triggers
Highlights new challenges for backdoor defenses
Abstract
Backdoor attacks are commonly executed by contaminating training data, such that a trigger can activate predetermined harmful effects during the test phase. In this work, we present AnyDoor, a test-time backdoor attack against multimodal large language models (MLLMs), which involves injecting the backdoor into the textual modality using adversarial test images (sharing the same universal perturbation), without requiring access to or modification of the training data. AnyDoor employs similar techniques used in universal adversarial attacks, but distinguishes itself by its ability to decouple the timing of setup and activation of harmful effects. In our experiments, we validate the effectiveness of AnyDoor against popular MLLMs such as LLaVA-1.5, MiniGPT-4, InstructBLIP, and BLIP-2, as well as provide comprehensive ablation studies. Notably, because the backdoor is injected by a universal…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The experiments in the article are comprehensive, with experiments conducted on multiple multimodal large models. - The writing/presentation is good, and some of the visual explanations are quite clear. - The test-time backdoor researched in this article is quite interesting.
- The method section seems quite vague. After reading, I still do not understand the principle of backdoor injection during the test phrase. It seems that the article dedicates a large portion of its content to introducing the scenario and highlighting the differences from traditional scenarios in the methodology section. - It seems that the article did not analyze the threat model. Who is the attacker during the testing phase? Who is the victim? What are the capabilities of the attacker? Where
- The research motivation is clearly articulated, and the writing is coherent. - The exploration of backdoor attacks on multimodal large language models represents a novel research area.
- **Unclear Definition of Backdoor Attack:** The definition of the attack is ambiguous. The proposed method aligns more with adversarial attacks than with traditional backdoor attacks. Typically, a backdoor attack involves two key components: (1) Backdoor injection (through poisoning or weight/activation manipulation) and (2) Backdoor activation using a predefined trigger to control the model’s output target. The injection phase establishes a mapping between the trigger and a specific target lab
- Optimizing the backdoor perturbation/trigger at test time is both novel and practical for MLLMs. Unlike data poisoning—which may be impractical in some settings—test-time optimization of the backdoor perturbation/trigger is a more realistic approach for real-world scenarios. This introduces a new form of adversarial threat for many MLLMs. - In this threat model, the backdoor trigger is embedded in the text domain, while the image is also perturbed. The harmful backdoor response is activated o
- Although not explicitly stated, AnyDoor appears to require gradient access for perturbation optimization, implying a white-box setting. This requirement could limit its practical applicability, as many MLLMs are deployed as services without granting gradient access to users. This restriction makes the attack challenging to execute in real-world. - The results in Table 9 indicate limited black-box transferability. When using LLaVA-1.5 as the source model, it would be helpful to know if the atta
First, to the best of my knowledge, no prior work has explored attacks in this particular setting, combining adversarial perturbations on images with a triggering token. Second, the draft is generally clear and easy to follow. Third, the experimental evaluation includes a robust selection of state-of-the-art MLLMs.
First, given that existing approaches, such as Dong et al. (2023b), have already demonstrated that visual universal adversarial perturbations (UAPs) can be applied to MLLMs (potentially in a black-box setting as well), the technical novelty of this work appears somewhat limited, especially so given the limited transferability (see below). Second, a critical area that requires substantial expansion is the study of the transferability of the proposed attack. While a paragraph on page 10 touches o
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques
