Aligning the True Semantics: Constrained Decoupling and Distribution Sampling for Cross-Modal Alignment

Xiang Ma; Lexin Fang; Litian Xu; Caiming Zhang

arXiv:2603.05566·cs.LG·March 9, 2026

Aligning the True Semantics: Constrained Decoupling and Distribution Sampling for Cross-Modal Alignment

Xiang Ma, Lexin Fang, Litian Xu, Caiming Zhang

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel cross-modal alignment method called CDDS that decouples semantic and modality information in embeddings, using constrained dual-path UNet and distribution sampling to improve semantic consistency between vision and language.

Contribution

The paper proposes a new algorithm, CDDS, that effectively separates semantic from modality information and bridges modality gaps, advancing cross-modal alignment techniques.

Findings

01

Outperforms state-of-the-art methods by 6.6% to 14.2% on various benchmarks.

02

Uses a dual-path UNet for adaptive embedding decoupling.

03

Employs distribution sampling to address modality gaps.

Abstract

Cross-modal alignment is a crucial task in multimodal learning aimed at achieving semantic consistency between vision and language. This requires that image-text pairs exhibit similar semantics. Traditional algorithms pursue embedding consistency to achieve semantic consistency, ignoring the non-semantic information present in the embedding. An intuitive approach is to decouple the embeddings into semantic and modality components, aligning only the semantic component. However, this introduces two main challenges: (1) There is no established standard for distinguishing semantic and modal information. (2) The modality gap can cause semantic alignment deviation or information loss. To align the true semantics, we propose a novel cross-modal alignment algorithm via \textbf{C}onstrained \textbf{D}ecoupling and \textbf{D}istribution \textbf{S}ampling (CDDS). Specifically, (1) A dual-path UNet…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Aligning the True Semantics: Constrained Decoupling and Distribution Sampling for Cross-Modal Alignment· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis