Aligning the True Semantics: Constrained Decoupling and Distribution Sampling for Cross-Modal Alignment
Xiang Ma, Lexin Fang, Litian Xu, Caiming Zhang

TL;DR
This paper introduces a novel cross-modal alignment method called CDDS that decouples semantic and modality information in embeddings, using constrained dual-path UNet and distribution sampling to improve semantic consistency between vision and language.
Contribution
The paper proposes a new algorithm, CDDS, that effectively separates semantic from modality information and bridges modality gaps, advancing cross-modal alignment techniques.
Findings
Outperforms state-of-the-art methods by 6.6% to 14.2% on various benchmarks.
Uses a dual-path UNet for adaptive embedding decoupling.
Employs distribution sampling to address modality gaps.
Abstract
Cross-modal alignment is a crucial task in multimodal learning aimed at achieving semantic consistency between vision and language. This requires that image-text pairs exhibit similar semantics. Traditional algorithms pursue embedding consistency to achieve semantic consistency, ignoring the non-semantic information present in the embedding. An intuitive approach is to decouple the embeddings into semantic and modality components, aligning only the semantic component. However, this introduces two main challenges: (1) There is no established standard for distinguishing semantic and modal information. (2) The modality gap can cause semantic alignment deviation or information loss. To align the true semantics, we propose a novel cross-modal alignment algorithm via \textbf{C}onstrained \textbf{D}ecoupling and \textbf{D}istribution \textbf{S}ampling (CDDS). Specifically, (1) A dual-path UNet…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
