VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
Jiawei Liang, Siyuan Liang, Man Luo, Aishan Liu, Dongchen Han,, Ee-Chien Chang, Xiaochun Cao

TL;DR
This paper introduces VL-Trojan, a novel multimodal instruction backdoor attack on autoregressive visual language models, demonstrating its effectiveness and robustness in manipulating model outputs during inference.
Contribution
The paper presents VL-Trojan, a new backdoor attack method that overcomes visual encoder constraints and black-box access limitations, significantly improving attack success rates.
Findings
Achieves +62.52% ASR over baselines
Effective across different model scales
Robust in few-shot reasoning scenarios
Abstract
Autoregressive Visual Language Models (VLMs) showcase impressive few-shot learning capabilities in a multimodal context. Recently, multimodal instruction tuning has been proposed to further enhance instruction-following abilities. However, we uncover the potential threat posed by backdoor attacks on autoregressive VLMs during instruction tuning. Adversaries can implant a backdoor by injecting poisoned samples with triggers embedded in instructions or images, enabling malicious manipulation of the victim model's predictions with predefined triggers. Nevertheless, the frozen visual encoder in autoregressive VLMs imposes constraints on the learning of conventional image triggers. Additionally, adversaries may encounter restrictions in accessing the parameters and architectures of the victim model. To address these challenges, we propose a multimodal instruction backdoor attack, namely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Hate Speech and Cyberbullying Detection
