Set-CLIP: Exploring Aligned Semantic From Low-Alignment Multimodal Data Through A Distribution View
Zijia Song, Zelin Zang, Yelin Wang, Guozheng Yang, Kaicheng yu, Wanyu, Chen, Miaoyu Wang, Stan Z. Li

TL;DR
Set-CLIP introduces a novel semi-supervised multimodal alignment method that leverages distribution constraints and implicit semantic extraction to perform well even with minimal paired data, outperforming existing models significantly.
Contribution
The paper proposes Set-CLIP, a new approach that reframes multimodal alignment as a manifold matching problem using distribution constraints, enabling effective learning from low-alignment data.
Findings
Set-CLIP achieves 144.83% improvement over CLIP without paired data.
The method effectively extracts implicit semantic alignment from unpaired multimodal data.
Set-CLIP performs well across diverse tasks like protein analysis and remote sensing.
Abstract
Multimodal fusion breaks through the boundaries between diverse modalities and has already achieved notable performances. However, in many specialized fields, it is struggling to obtain sufficient alignment data for training, which seriously limits the use of previously effective models. Therefore, semi-supervised learning approaches are attempted to facilitate multimodal alignment by learning from low-alignment data with fewer matched pairs, but traditional techniques like pseudo-labeling may run into troubles in the label-deficient scenarios. To tackle these challenges, we reframe semi-supervised multimodal alignment as a manifold matching issue and propose a new methodology based on CLIP, termed Set-CLIP. Specifically, by designing a novel semantic density distribution loss, we constrain the latent representation distribution with fine granularity and extract implicit semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
MethodsContrastive Language-Image Pre-training
