GDCNet: Generative Discrepancy Comparison Network for Multimodal Sarcasm Detection
Shuguang Zhang, Junhong Lian, Guoxin Yu, Baoxun Xu, Xiang Ao

TL;DR
GDCNet is a novel framework for multimodal sarcasm detection that leverages factually grounded image captions from Multimodal LLMs to measure semantic and sentiment discrepancies, improving accuracy and robustness.
Contribution
The paper introduces GDCNet, which uses grounded image captions as semantic anchors to better detect sarcasm across image-text pairs, addressing limitations of previous methods.
Findings
Achieves state-of-the-art accuracy on MSD benchmarks.
Effectively captures semantic and sentiment discrepancies.
Demonstrates robustness in diverse multimodal scenarios.
Abstract
Multimodal sarcasm detection (MSD) aims to identify sarcasm within image-text pairs by modeling semantic incongruities across modalities. Existing methods often exploit cross-modal embedding misalignment to detect inconsistency but struggle when visual and textual content are loosely related or semantically indirect. While recent approaches leverage large language models (LLMs) to generate sarcastic cues, the inherent diversity and subjectivity of these generations often introduce noise. To address these limitations, we propose the Generative Discrepancy Comparison Network (GDCNet). This framework captures cross-modal conflicts by utilizing descriptive, factually grounded image captions generated by Multimodal LLMs (MLLMs) as stable semantic anchors. Specifically, GDCNet computes semantic and sentiment discrepancies between the generated objective description and the original text,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Multimodal Machine Learning Applications · Topic Modeling
