A Review of Vision-Language Models and their Performance on the Hateful Memes Challenge
Bryan Zhao, Andrew Zhang, Blake Watson, Gillian Kearney, Isaac Dale

TL;DR
This paper evaluates various multimodal models for detecting hateful memes, finding that early fusion models, especially CLIP, outperform late fusion approaches in the Hateful Memes Challenge.
Contribution
It provides a comparative analysis of early and late fusion multimodal models for hate speech detection in memes, highlighting the superior performance of early fusion methods.
Findings
Early fusion models outperform late fusion models.
CLIP achieved the highest AUROC of 70.06.
Early fusion models are more effective for multimodal hate detection.
Abstract
Moderation of social media content is currently a highly manual task, yet there is too much content posted daily to do so effectively. With the advent of a number of multimodal models, there is the potential to reduce the amount of manual labor for this task. In this work, we aim to explore different models and determine what is most effective for the Hateful Memes Challenge, a challenge by Meta designed to further machine learning research in content moderation. Specifically, we explore the differences between early fusion and late fusion models in classifying multimodal memes containing text and images. We first implement a baseline using unimodal models for text and images separately using BERT and ResNet-152, respectively. The outputs from these unimodal models were then concatenated together to create a late fusion model. In terms of early fusion models, we implement ConcatBERT,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Misinformation and Its Impacts
MethodsAttention Is All You Need · Linear Warmup With Linear Decay · Softmax · Layer Normalization · Linear Layer · WordPiece · Dropout · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention
