Self-MI: Efficient Multimodal Fusion via Self-Supervised Multi-Task   Learning with Auxiliary Mutual Information Maximization

Cam-Van Thi Nguyen; Ngoc-Hoa Thi Nguyen; Duc-Trong Le; Quang-Thuy Ha

arXiv:2311.03785·cs.CV·November 8, 2023·1 cites

Self-MI: Efficient Multimodal Fusion via Self-Supervised Multi-Task Learning with Auxiliary Mutual Information Maximization

Cam-Van Thi Nguyen, Ngoc-Hoa Thi Nguyen, Duc-Trong Le, Quang-Thuy Ha

PDF

Open Access

TL;DR

Self-MI introduces a self-supervised multimodal fusion method that maximizes mutual information between unimodal inputs and fused representations, improving performance on benchmark datasets.

Contribution

It proposes a novel self-supervised learning framework with auxiliary MI maximization and a label generation module for better multimodal fusion.

Findings

01

Enhanced fusion performance on CMU-MOSI, CMU-MOSEI, and SIMS datasets.

02

Effective mutual information maximization improves modality alignment.

03

Outperforms existing multimodal fusion methods.

Abstract

Multimodal representation learning poses significant challenges in capturing informative and distinct features from multiple modalities. Existing methods often struggle to exploit the unique characteristics of each modality due to unified multimodal annotations. In this study, we propose Self-MI in the self-supervised learning fashion, which also leverage Contrastive Predictive Coding (CPC) as an auxiliary technique to maximize the Mutual Information (MI) between unimodal input pairs and the multimodal fusion result with unimodal inputs. Moreover, we design a label generation module, $U L G_{M I}$ for short, that enables us to create meaningful and informative labels for each modality in a self-supervised manner. By maximizing the Mutual Information, we encourage better alignment between the multimodal fusion and the individual modalities, facilitating improved multimodal fusion. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Domain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis

MethodsInfoNCE · Contrastive Predictive Coding