Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence

Wenzhe Yin; Zehao Xiao; Pan Zhou; Shujian Yu; Jiayi Shen; Jan-Jakob Sonke; Efstratios Gavves

arXiv:2502.17028·cs.LG·February 25, 2026

Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence

Wenzhe Yin, Zehao Xiao, Pan Zhou, Shujian Yu, Jiayi Shen, Jan-Jakob Sonke, Efstratios Gavves

PDF

Open Access 3 Reviews

TL;DR

This paper introduces CS-Aligner, a novel framework that uses Cauchy-Schwarz divergence for distributional vision-language alignment, improving over existing methods by capturing global distributional info and semantic relationships, leading to better alignment and flexibility.

Contribution

The paper proposes CS-Aligner, a new approach combining Cauchy-Schwarz divergence with mutual information for enhanced distributional vision-language alignment, addressing limitations of previous pairwise methods.

Findings

01

Improved alignment in text-to-image generation and retrieval tasks.

02

CS divergence complements InfoNCE, resolving alignment-uniformity conflicts.

03

Enables incorporation of unpaired data and token-level info.

Abstract

Vision-language alignment is crucial for various downstream tasks such as cross-modal generation and retrieval. Previous multimodal approaches like CLIP utilize InfoNCE to maximize mutual information, primarily aligning pairwise samples across modalities while overlooking distributional differences. In addition, InfoNCE has inherent conflict in terms of alignment and uniformity in multimodality, leading to suboptimal alignment with modality gaps. To overcome the limitations, we propose CS-Aligner, a novel framework that performs distributional vision-language alignment by integrating Cauchy-Schwarz (CS) divergence with mutual information. CS-Aligner captures both the global distribution information of each modality and the pairwise semantic relationships. We find that the CS divergence seamlessly addresses the InfoNCE's alignment-uniformity conflict and serves complementary roles with…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 10Confidence 5

Strengths

The paper is described in detail. The proposed method is mathematically solid. This paper gives important insight to the InfoNCE loss and proposes an important method that would give great impacts to the multimodal research community. Experimental results both in the main body and supplementary are convincing enough.

Weaknesses

The proposed method has some parameter sensitivity as shown in Table 7 in the supplemental material. I do not think the method is parameter-independent as the authors claim. - Could you show soma sample images to show the method is robust to parameter changes, in addition to showing FID? - Is there any method that can automatically find optimal parameter settings rather than relying on users’ empirical optimization? Some minor modification proposals (no need to reply) - A missing period after

Reviewer 02Rating 6Confidence 4

Strengths

1. The authors integrate Cauchy-Schwarz divergence with mutual information (InfoNCE) and provide a theoretical analysis of how it alleviates the alignment–uniformity conflict, addressing limitations in InfoNCE-based alignment. 2. Leveraging distribution-level alignment, the method naturally supports unpaired data and multi-caption supervision, offering greater flexibility and scalability for real-world applications. 3. The use of lightweight adapters (e.g., Adapter, LoRA) enables efficient align

Weaknesses

1. Although the method is well-motivated, the theoretical analysis of convergence, generalization, or stability of CS-Aligner remains shallow and largely empirical. 2. The KDE-based CS divergence may introduce additional computational overhead and sensitivity to kernel bandwidth selection, which is not systematically analyzed. 3. Some comparisons (e.g., Eclipse, LLM2CLIP) reuse decoders from other works, which may not fully isolate alignment quality from downstream generative components. 4. The

Reviewer 03Rating 4Confidence 3

Strengths

1. Clearly identifies InfoNCE’s limitations—its conflation of alignment and uniformity objectives and dependence on strong negatives that weaken cross-modal consistency. 2. Introduces CS divergence as a novel, theoretically grounded approach for distributional alignment, effectively addressing the shortcomings of sample-level contrastive methods.

Weaknesses

1. The experiments are mainly focused on t2i and i2t tasks, which are not sufficient to verify the effectivess. 2. The ablation studies in Appendix H.1 are insufficient and unconvincing. The experimental settings are not clearly defined. Furthermore, the paper fails to adequately explain the core concepts of InfoNCE and CS-Aligner, and does not justify the significant performance degradation observed when using only InfoNCE. While these experiments are in the appendix, it is essential to tr

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptics and Image Analysis

MethodsInfoNCE · Contrastive Language-Image Pre-training