On VLMs for Diverse Tasks in Multimodal Meme Classification
Deepesh Gavit, Debajyoti Mazumder, Samiran Das, Jasabanta Patro

TL;DR
This paper systematically analyzes vision-language models for meme classification, introducing a novel approach that combines VLMs and LLMs to improve accuracy across various meme understanding tasks.
Contribution
It presents a new method that uses detailed meme interpretations from VLMs to train smaller LLMs, enhancing classification performance.
Findings
Improved baseline performance by up to 26.24% in sentiment classification.
Benchmarking of VLMs with diverse prompting strategies.
Assessment of LoRA fine-tuning across VLM components.
Abstract
In this paper, we present a comprehensive and systematic analysis of vision-language models (VLMs) for disparate meme classification tasks. We introduced a novel approach that generates a VLM-based understanding of meme images and fine-tunes the LLMs on textual understanding of the embedded meme text for improving the performance. Our contributions are threefold: (1) Benchmarking VLMs with diverse prompting strategies purposely to each sub-task; (2) Evaluating LoRA fine-tuning across all VLM components to assess performance gains; and (3) Proposing a novel approach where detailed meme interpretations generated by VLMs are used to train smaller language models (LLMs), significantly improving classification. The strategy of combining VLMs with LLMs improved the baseline performance by 8.34%, 3.52% and 26.24% for sarcasm, offensive and sentiment classification, respectively. Our results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Sentiment Analysis and Opinion Mining · Advanced Malware Detection Techniques
