SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs
Shuhan Xu, Siyuan Liang, Hongling Zheng, Aishan Liu, Xinbiao Wang, Yong Luo, Fu Lin, Leszek Rutkowski, Dacheng Tao

TL;DR
This paper introduces SRD, a reinforcement learning-based method that detects and mitigates backdoor attacks in visual language models by applying semantic perturbations, significantly reducing attack success rates while maintaining caption quality.
Contribution
The paper proposes a trigger-agnostic, reinforcement learning framework that disrupts backdoor triggers in VLMs without prior trigger knowledge, enhancing model robustness.
Findings
Reduces backdoor attack success rate to below 6%.
Maintains over 85% of original caption quality on clean inputs.
Effective against both local and global backdoor attacks.
Abstract
Visual language models (VLMs) have made significant progress in image captioning tasks, yet recent studies have found they are vulnerable to backdoor attacks. Attackers can inject undetectable perturbations into the data during inference, triggering abnormal behavior and generating malicious captions. These attacks are particularly challenging to detect and defend against due to the stealthiness and cross-modal propagation of the trigger signals. In this paper, we identify two key vulnerabilities by analyzing existing attack patterns: (1) the model exhibits abnormal attention concentration on certain regions of the input image, and (2) backdoor attacks often induce semantic drift and sentence incoherence. Based on these insights, we propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without requiring any prior knowledge of trigger…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis
