TL;DR
This paper investigates hate speech detection in multimodal social media posts combining text and images, introducing a large dataset and analyzing the effectiveness of joint models versus unimodal approaches.
Contribution
It presents MMHS150K, a large annotated dataset for multimodal hate speech detection, and compares multimodal models with unimodal ones, highlighting current limitations.
Findings
Images aid hate speech detection but do not outperform text-only models
Multimodal models currently underperform compared to unimodal text models
The paper discusses challenges and opens avenues for future research
Abstract
In this work we target the problem of hate speech detection in multimodal publications formed by a text and an image. We gather and annotate a large scale dataset from Twitter, MMHS150K, and propose different models that jointly analyze textual and visual information for hate speech detection, comparing them with unimodal detection. We provide quantitative and qualitative results and analyze the challenges of the proposed task. We find that, even though images are useful for the hate speech detection task, current multimodal models cannot outperform models analyzing only text. We discuss why and open the field and the dataset for further research.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
