Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment
Ming Zhang, Ke Chang, Yunfang Wu

TL;DR
This paper introduces a CLIP-guided contrastive learning architecture for multi-modal semantic understanding, effectively aligning features from different modalities to improve performance on tasks like sarcasm detection and sentiment analysis.
Contribution
The proposed model uniquely employs contrastive learning for cross-modal feature alignment, enhancing multi-modal understanding without relying on task-specific external knowledge.
Findings
Significant performance improvements on MMSD and MMSA tasks
Effective cross-modal feature alignment demonstrated
Model is simple to implement and adaptable
Abstract
Multi-modal semantic understanding requires integrating information from different modalities to extract users' real intention behind words. Most previous work applies a dual-encoder structure to separately encode image and text, but fails to learn cross-modal feature alignment, making it hard to achieve cross-modal deep information interaction. This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment, which projects the features derived from different modalities into a unified deep space. On multi-modal sarcasm detection (MMSD) and multi-modal sentiment analysis (MMSA) tasks, the experimental results show that our proposed model significantly outperforms several baselines, and our feature alignment strategy brings obvious performance gain over models with different aggregating methods and models even enriched with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
