Multi-modal Semantic Understanding with Contrastive Cross-modal Feature   Alignment

Ming Zhang; Ke Chang; Yunfang Wu

arXiv:2403.06355·cs.CL·March 12, 2024·3 cites

Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment

Ming Zhang, Ke Chang, Yunfang Wu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a CLIP-guided contrastive learning architecture for multi-modal semantic understanding, effectively aligning features from different modalities to improve performance on tasks like sarcasm detection and sentiment analysis.

Contribution

The proposed model uniquely employs contrastive learning for cross-modal feature alignment, enhancing multi-modal understanding without relying on task-specific external knowledge.

Findings

01

Significant performance improvements on MMSD and MMSA tasks

02

Effective cross-modal feature alignment demonstrated

03

Model is simple to implement and adaptable

Abstract

Multi-modal semantic understanding requires integrating information from different modalities to extract users' real intention behind words. Most previous work applies a dual-encoder structure to separately encode image and text, but fails to learn cross-modal feature alignment, making it hard to achieve cross-modal deep information interaction. This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment, which projects the features derived from different modalities into a unified deep space. On multi-modal sarcasm detection (MMSD) and multi-modal sentiment analysis (MMSA) tasks, the experimental results show that our proposed model significantly outperforms several baselines, and our feature alignment strategy brings obvious performance gain over models with different aggregating methods and models even enriched with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

changke123/clfa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling