Set-CLIP: Exploring Aligned Semantic From Low-Alignment Multimodal Data   Through A Distribution View

Zijia Song; Zelin Zang; Yelin Wang; Guozheng Yang; Kaicheng yu; Wanyu; Chen; Miaoyu Wang; Stan Z. Li

arXiv:2406.05766·cs.LG·September 24, 2024·1 cites

Set-CLIP: Exploring Aligned Semantic From Low-Alignment Multimodal Data Through A Distribution View

Zijia Song, Zelin Zang, Yelin Wang, Guozheng Yang, Kaicheng yu, Wanyu, Chen, Miaoyu Wang, Stan Z. Li

PDF

Open Access

TL;DR

Set-CLIP introduces a novel semi-supervised multimodal alignment method that leverages distribution constraints and implicit semantic extraction to perform well even with minimal paired data, outperforming existing models significantly.

Contribution

The paper proposes Set-CLIP, a new approach that reframes multimodal alignment as a manifold matching problem using distribution constraints, enabling effective learning from low-alignment data.

Findings

01

Set-CLIP achieves 144.83% improvement over CLIP without paired data.

02

The method effectively extracts implicit semantic alignment from unpaired multimodal data.

03

Set-CLIP performs well across diverse tasks like protein analysis and remote sensing.

Abstract

Multimodal fusion breaks through the boundaries between diverse modalities and has already achieved notable performances. However, in many specialized fields, it is struggling to obtain sufficient alignment data for training, which seriously limits the use of previously effective models. Therefore, semi-supervised learning approaches are attempted to facilitate multimodal alignment by learning from low-alignment data with fewer matched pairs, but traditional techniques like pseudo-labeling may run into troubles in the label-deficient scenarios. To tackle these challenges, we reframe semi-supervised multimodal alignment as a manifold matching issue and propose a new methodology based on CLIP, termed Set-CLIP. Specifically, by designing a novel semantic density distribution loss, we constrain the latent representation distribution with fine granularity and extract implicit semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies

MethodsContrastive Language-Image Pre-training