An Interpretable Approach to Hateful Meme Detection

Tanvi Deshpande; Nitya Mani

arXiv:2108.10069·cs.LG·August 24, 2021

An Interpretable Approach to Hateful Meme Detection

Tanvi Deshpande, Nitya Mani

PDF

TL;DR

This paper presents an interpretable machine learning approach for hateful meme detection, achieving performance comparable to state-of-the-art models while emphasizing feature importance and transparency.

Contribution

It introduces a gradient-boosted decision tree and LSTM-based models with interpretability focus, advancing hateful meme detection methods.

Findings

01

Models achieve 73.8 validation auROC

02

Models achieve 72.7 test auROC

03

Performance comparable to human and transformer models

Abstract

Hateful memes are an emerging method of spreading hate on the internet, relying on both images and text to convey a hateful message. We take an interpretable approach to hateful meme detection, using machine learning and simple heuristics to identify the features most important to classifying a meme as hateful. In the process, we build a gradient-boosted decision tree and an LSTM-based model that achieve comparable performance (73.8 validation and 72.7 test auROC) to the gold standard of humans and state-of-the-art transformer models on this challenging task.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.