Meme Trojan: Backdoor Attacks Against Hateful Meme Detection via Cross-Modal Triggers
Ruofei Wang, Hongzhan Lin, Ziyuan Luo, Ka Chun Cheung, Simon See, Jing, Ma, Renjie Wan

TL;DR
This paper reveals a new threat to hateful meme detection systems through backdoor attacks using cross-modal triggers, demonstrating an effective and stealthy attack framework called Meme Trojan.
Contribution
It introduces Meme Trojan, a novel backdoor attack method utilizing cross-modal triggers and adaptive injection techniques for hateful meme detectors.
Findings
Meme Trojan outperforms existing backdoor attack methods in effectiveness.
The proposed triggers are highly stealthy and seamlessly integrated.
The framework demonstrates significant success under automatic application scenarios.
Abstract
Hateful meme detection aims to prevent the proliferation of hateful memes on various social media platforms. Considering its impact on social environments, this paper introduces a previously ignored but significant threat to hateful meme detection: backdoor attacks. By injecting specific triggers into meme samples, backdoor attackers can manipulate the detector to output their desired outcomes. To explore this, we propose the Meme Trojan framework to initiate backdoor attacks on hateful meme detection. Meme Trojan involves creating a novel Cross-Modal Trigger (CMT) and a learnable trigger augmentor to enhance the trigger pattern according to each input sample. Due to the cross-modal property, the proposed CMT can effectively initiate backdoor attacks on hateful meme detectors under an automatic application scenario. Additionally, the injection position and size of our triggers are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Misinformation and Its Impacts
