TL;DR
This paper introduces a novel multimodal sarcasm detection model that combines RoBERTa with co-attention and FiLMed ResNet blocks, significantly improving accuracy on Twitter datasets by leveraging text-image incongruity.
Contribution
The paper proposes a new architecture integrating co-attention and FiLMed ResNet to effectively fuse multimodal data for sarcasm detection, outperforming existing methods.
Findings
Achieved 6.14% higher F1 score than previous state-of-the-art.
Effectively models context incongruity between text and images.
Demonstrates the benefit of multimodal fusion in sarcasm detection.
Abstract
Sarcasm detection identifies natural language expressions whose intended meaning is different from what is implied by its surface meaning. It finds applications in many NLP tasks such as opinion mining, sentiment analysis, etc. Today, social media has given rise to an abundant amount of multimodal data where users express their opinions through text and images. Our paper aims to leverage multimodal data to improve the performance of the existing systems for sarcasm detection. So far, various approaches have been proposed that uses text and image modality and a fusion of both. We propose a novel architecture that uses the RoBERTa model with a co-attention layer on top to incorporate context incongruity between input text and image attributes. Further, we integrate feature-wise affine transformation by conditioning the input image through FiLMed ResNet blocks with the textual features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · WordPiece · Adam · Attention Dropout · Weight Decay · Dropout · Dense Connections · Softmax · Linear Warmup With Linear Decay
