InverTune: Removing Backdoors from Multimodal Contrastive Learning Models via Trigger Inversion and Activation Tuning
Mengyuan Sun, Yu Li, Yuchen Liu, Bo Du, Yunjie Ge

TL;DR
InverTune is a novel defense framework that effectively removes backdoors from multimodal contrastive models like CLIP without prior attack knowledge, using trigger inversion and activation tuning techniques.
Contribution
It introduces a minimal assumption backdoor defense method for multimodal models, utilizing adversarial simulation, gradient inversion, and clustering-guided fine-tuning.
Findings
Reduces attack success rate by 97.87%
Limits clean accuracy degradation to 3.07%
Works without prior knowledge or poisoned data
Abstract
Multimodal contrastive learning models like CLIP have demonstrated remarkable vision-language alignment capabilities, yet their vulnerability to backdoor attacks poses critical security risks. Attackers can implant latent triggers that persist through downstream tasks, enabling malicious control of model behavior upon trigger presentation. Despite great success in recent defense mechanisms, they remain impractical due to strong assumptions about attacker knowledge or excessive clean data requirements. In this paper, we introduce InverTune, the first backdoor defense framework for multimodal models under minimal attacker assumptions, requiring neither prior knowledge of attack targets nor access to the poisoned dataset. Unlike existing defense methods that rely on the same dataset used in the poisoning stage, InverTune effectively identifies and removes backdoor artifacts through three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsContrastive Learning · Contrastive Language-Image Pre-training
