Robust Anti-Backdoor Instruction Tuning in LVLMs
Yuan Xun, Siyuan Liang, Xiaojun Jia, Xinwei Liu, Xiaochun Cao

TL;DR
This paper proposes a lightweight, certified-agnostic defense framework called Robust Instruction Tuning that finetunes only adapter modules and text embeddings in LVLMs, effectively mitigating backdoor attacks without modifying core weights or requiring attack priors.
Contribution
The paper introduces a novel defense method that employs input diversity and anomalous activation regularizations to prevent LVLMs from overfitting to backdoor triggers during adapter-level tuning.
Findings
Reduces attack success rate to nearly zero across seven attack types.
Increases training cost by less than 15%.
Effective against unseen trigger patterns without prior knowledge.
Abstract
Large visual language models (LVLMs) have demonstrated excellent instruction-following capabilities, yet remain vulnerable to stealthy backdoor attacks when finetuned using contaminated data. Existing backdoor defense techniques are usually developed for single-modal visual or language models under fully parameter-adjustable settings or rely on supervisory knowledge during training. However, in real-world scenarios, defenders cannot modify frozen visual encoders or core LLM parameters, nor possess prior knowledge of unknown trigger patterns or target responses. Motivated by the empirical finding that LVLMs readily overfit to fixed, unknown triggers, which can embed malicious associations during adapter-level tuning, we aim to design a defense that operates without access to core weights or attack priors. To this end, we introduce a lightweight, certified-agnostic defense framework,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Multimodal Machine Learning Applications
MethodsActivation Regularization · Adapter
