MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection

Haochen Zhao; Yuyao Kong; Yongxiu Xu; Gaopeng Gou; Hongbo Xu; Yubin Wang; Haoliang Zhang

arXiv:2510.23299·cs.CV·March 2, 2026

MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection

Haochen Zhao, Yuyao Kong, Yongxiu Xu, Gaopeng Gou, Hongbo Xu, Yubin Wang, Haoliang Zhang

PDF

TL;DR

This paper introduces MMSD3.0, a multi-image sarcasm detection benchmark, along with CIRM, a model that captures inter-image relations, improving sarcasm detection in real-world multi-image scenarios.

Contribution

The paper presents MMSD3.0, a novel multi-image sarcasm dataset, and proposes CIRM, a cross-image reasoning model with a relevance-guided fusion mechanism, advancing multimodal sarcasm detection.

Findings

01

CIRM achieves state-of-the-art results on MMSD, MMSD2.0, and MMSD3.0.

02

MMSD3.0 better reflects real-world multi-image sarcasm scenarios.

03

The proposed model effectively captures inter-image relations.

Abstract

Despite progress in multimodal sarcasm detection, existing datasets and methods predominantly focus on single-image scenarios, overlooking potential semantic and affective relations across multiple images. This leaves a gap in modeling cases where sarcasm is triggered by multi-image cues in real-world settings. To bridge this gap, we introduce MMSD3.0, a new benchmark composed entirely of multi-image samples curated from tweets and Amazon reviews. We further propose the Cross-Image Reasoning Model (CIRM), which performs targeted cross-image sequence modeling to capture latent inter-image connections. In addition, we introduce a relevance-guided, fine-grained cross-modal fusion mechanism based on text-image correspondence to reduce information loss during integration. We establish a comprehensive suite of strong and representative baselines and conduct extensive experiments, showing that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.