An Enhanced Dual Transformer Contrastive Network for Multimodal Sentiment Analysis
Phuong Q. Dao, Mark Roantree, and Vuong M. Ngo

TL;DR
This paper introduces a novel multimodal sentiment analysis model that combines Transformer encoders for text and images with contrastive learning to improve cross-modal representation and sentiment classification accuracy.
Contribution
It proposes BERT-ViT-EF with early fusion and extends it with DTCN, incorporating an additional Transformer layer and contrastive learning for enhanced multimodal sentiment analysis.
Findings
DTCN achieves 78.4% accuracy on TumEmo.
The model outperforms previous methods on benchmark datasets.
Early fusion and contrastive learning improve multimodal representation quality.
Abstract
Multimodal Sentiment Analysis (MSA) seeks to understand human emotions by jointly analyzing data from multiple modalities typically text and images offering a richer and more accurate interpretation than unimodal approaches. In this paper, we first propose BERT-ViT-EF, a novel model that combines powerful Transformer-based encoders BERT for textual input and ViT for visual input through an early fusion strategy. This approach facilitates deeper cross-modal interactions and more effective joint representation learning. To further enhance the model's capability, we propose an extension called the Dual Transformer Contrastive Network (DTCN), which builds upon BERT-ViT-EF. DTCN incorporates an additional Transformer encoder layer after BERT to refine textual context (before fusion) and employs contrastive learning to align text and image representations, fostering robust multimodal feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
