OSPC: Detecting Harmful Memes with Large Language Model as a Catalyst
Jingtao Cao, Zheng Zhang, Hongru Wang, Bin Liang, Hao Wang, Kam-Fai, Wong

TL;DR
This paper introduces a comprehensive system combining image captioning, OCR, and large language models to detect harmful memes across multiple languages, achieving state-of-the-art performance in a Singaporean context.
Contribution
The study presents a novel multi-modal framework integrating various AI models and fine-tuning with GPT-4V data for effective harmful meme detection in multilingual settings.
Findings
Achieved top-1 at AI Singapore's Online Safety Prize Challenge
Outperformed previous benchmarks like FLAVA and VisualBERT
System effectively detects harmful content in four languages
Abstract
Memes, which rapidly disseminate personal opinions and positions across the internet, also pose significant challenges in propagating social bias and prejudice. This study presents a novel approach to detecting harmful memes, particularly within the multicultural and multilingual context of Singapore. Our methodology integrates image captioning, Optical Character Recognition (OCR), and Large Language Model (LLM) analysis to comprehensively understand and classify harmful memes. Utilizing the BLIP model for image captioning, PP-OCR and TrOCR for text recognition across multiple languages, and the Qwen LLM for nuanced language understanding, our system is capable of identifying harmful content in memes created in English, Chinese, Malay, and Tamil. To enhance the system's performance, we fine-tuned our approach by leveraging additional data labeled using GPT-4V, aiming to distill the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Dense Connections · Residual Connection · Softmax · Layer Normalization · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · VisualBERT · BLIP: Bootstrapping Language-Image Pre-training
