XMeCap: Meme Caption Generation with Sub-Image Adaptability
Yuyan Chen, Songzhou Yan, Zhihong Zhu, Zhixu Li, Yanghua Xiao

TL;DR
This paper introduces XMeCap, a novel framework for meme captioning that effectively handles multi-image memes by combining supervised fine-tuning and reinforcement learning, significantly improving caption quality across meme types.
Contribution
The paper presents XMeCap, a new multi-modal meme captioning model that incorporates an innovative reward system and outperforms existing methods in accuracy and versatility.
Findings
XMeCap achieves higher evaluation scores than baseline models.
The framework improves captioning for both single and multi-image memes.
Results demonstrate enhanced understanding of humor in multi-modal contexts.
Abstract
Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively by memes. This paper poses a particular emphasis on the impact of multi-images on meme captioning. After that, we introduce the \textsc{XMeCap} framework, a novel approach that adopts supervised fine-tuning and reinforcement learning based on an innovative reward model, which factors in both global and local similarities between visuals and text. Our results, benchmarked against contemporary models, manifest a marked improvement in caption generation for both single-image and multi-image memes, as well as different meme categories. \textsc{XMeCap} achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
