TL;DR
InterCLIP-MEP is a novel multi-modal sarcasm detection framework that combines efficient cross-modal representations with a memory-augmented predictor, achieving state-of-the-art results with fewer trainable parameters and improved robustness.
Contribution
The paper introduces InterCLIP-MEP, integrating Interactive CLIP for enriched cross-modal features and a Memory-Enhanced Predictor for improved sarcasm detection, with significantly fewer trainable parameters.
Findings
Achieves state-of-the-art performance on MMSD, MMSD2.0, and DocMSU datasets.
Improves accuracy by 1.08% and F1 score by 1.51% on MMSD2.0.
Outperforms previous methods under distributional shift, with nearly 10% higher accuracy.
Abstract
Sarcasm in social media, frequently conveyed through the interplay of text and images, presents significant challenges for sentiment analysis and intention mining. Existing multi-modal sarcasm detection approaches have been shown to excessively depend on superficial cues within the textual modality, exhibiting limited capability to accurately discern sarcasm through subtle text-image interactions. To address this limitation, a novel framework, InterCLIP-MEP, is proposed. This framework integrates Interactive CLIP (InterCLIP), which employs an efficient training strategy to derive enriched cross-modal representations by embedding inter-modal information directly into each encoder, while using approximately 20.6 fewer trainable parameters compared with existing state-of-the-art (SOTA) methods. Furthermore, a Memory-Enhanced Predictor (MEP) is introduced, featuring a dynamic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training
