Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts
Run Xu, Lu Li, Rongzhao Zhang, Jie Xu

TL;DR
This paper introduces a new multimodal task for generating culturally-aware humorous image captions, addressing the challenge of maintaining relevance, appropriateness, and humor across different cultural contexts.
Contribution
It proposes a staged alignment framework with preference alignment and cultural adaptation, improving humor quality and contextual fit in cross-cultural captioning.
Findings
Model achieves better performance in contextual fit and humor quality.
Large gains in cultural relevance and balancing image relevance with humor.
Framework effectively adapts to Eastern cultural contexts with minimal supervision.
Abstract
Recent multimodal large language models have shown promising ability in generating humorous captions for images, yet they still lack stable control over explicit cultural context, making it difficult to jointly maintain image relevance, contextual appropriateness, and humor quality under a specified cultural background. To address this limitation, we introduce a new multimodal generation task, culture-aware humorous captioning, which requires a model to generate a humorous caption conditioned on both an input image and a target cultural context. Captions generated under different cultural contexts are not expected to share the same surface form, but should remain grounded in similar visual situations or humorous rationales.To support this task, we establish a six-dimensional evaluation framework covering image relevance, contextual fit, semantic richness, reasonableness, humor, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
