Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes
Shiling Deng, Serge Belongie, Peter Ebert Christensen

TL;DR
This paper introduces a large meme dataset, an automated annotation pipeline using vision-language models, and a specialized CLIP model for meme-text retrieval to improve understanding and analysis of memes.
Contribution
It presents a new large-scale meme dataset, an automated annotation framework, and a fine-tuned CLIP model for meme-text retrieval, advancing meme analysis capabilities.
Findings
Created the CM50 meme dataset with 33,000 memes
Developed an automated annotation pipeline for memes
Enhanced meme-text retrieval with a specialized CLIP model
Abstract
Memes have emerged as a powerful form of communication, integrating visual and textual elements to convey humor, satire, and cultural messages. Existing research has focused primarily on aspects such as emotion classification, meme generation, propagation, interpretation, figurative language, and sociolinguistics, but has often overlooked deeper meme comprehension and meme-text retrieval. To address these gaps, this study introduces ClassicMemes-50-templates (CM50), a large-scale dataset consisting of over 33,000 memes, centered around 50 popular meme templates. We also present an automated knowledge-grounded annotation pipeline leveraging large vision-language models to produce high-quality image captions, meme captions, and literary device labels overcoming the labor intensive demands of manual annotation. Additionally, we propose a meme-text retrieval CLIP model (mtrCLIP) that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Misinformation and Its Impacts
MethodsContrastive Language-Image Pre-training
