An Interpretable Approach to Hateful Meme Detection
Tanvi Deshpande, Nitya Mani

TL;DR
This paper presents an interpretable machine learning approach for hateful meme detection, achieving performance comparable to state-of-the-art models while emphasizing feature importance and transparency.
Contribution
It introduces a gradient-boosted decision tree and LSTM-based models with interpretability focus, advancing hateful meme detection methods.
Findings
Models achieve 73.8 validation auROC
Models achieve 72.7 test auROC
Performance comparable to human and transformer models
Abstract
Hateful memes are an emerging method of spreading hate on the internet, relying on both images and text to convey a hateful message. We take an interpretable approach to hateful meme detection, using machine learning and simple heuristics to identify the features most important to classifying a meme as hateful. In the process, we build a gradient-boosted decision tree and an LSTM-based model that achieve comparable performance (73.8 validation and 72.7 test auROC) to the gold standard of humans and state-of-the-art transformer models on this challenging task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
