Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
Wenzhe Yin, Zehao Xiao, Pan Zhou, Shujian Yu, Jiayi Shen, Jan-Jakob Sonke, Efstratios Gavves

TL;DR
This paper introduces CS-Aligner, a novel framework that uses Cauchy-Schwarz divergence for distributional vision-language alignment, improving over existing methods by capturing global distributional info and semantic relationships, leading to better alignment and flexibility.
Contribution
The paper proposes CS-Aligner, a new approach combining Cauchy-Schwarz divergence with mutual information for enhanced distributional vision-language alignment, addressing limitations of previous pairwise methods.
Findings
Improved alignment in text-to-image generation and retrieval tasks.
CS divergence complements InfoNCE, resolving alignment-uniformity conflicts.
Enables incorporation of unpaired data and token-level info.
Abstract
Vision-language alignment is crucial for various downstream tasks such as cross-modal generation and retrieval. Previous multimodal approaches like CLIP utilize InfoNCE to maximize mutual information, primarily aligning pairwise samples across modalities while overlooking distributional differences. In addition, InfoNCE has inherent conflict in terms of alignment and uniformity in multimodality, leading to suboptimal alignment with modality gaps. To overcome the limitations, we propose CS-Aligner, a novel framework that performs distributional vision-language alignment by integrating Cauchy-Schwarz (CS) divergence with mutual information. CS-Aligner captures both the global distribution information of each modality and the pairwise semantic relationships. We find that the CS divergence seamlessly addresses the InfoNCE's alignment-uniformity conflict and serves complementary roles with…
Peer Reviews
Decision·ICLR 2026 Poster
The paper is described in detail. The proposed method is mathematically solid. This paper gives important insight to the InfoNCE loss and proposes an important method that would give great impacts to the multimodal research community. Experimental results both in the main body and supplementary are convincing enough.
The proposed method has some parameter sensitivity as shown in Table 7 in the supplemental material. I do not think the method is parameter-independent as the authors claim. - Could you show soma sample images to show the method is robust to parameter changes, in addition to showing FID? - Is there any method that can automatically find optimal parameter settings rather than relying on users’ empirical optimization? Some minor modification proposals (no need to reply) - A missing period after
1. The authors integrate Cauchy-Schwarz divergence with mutual information (InfoNCE) and provide a theoretical analysis of how it alleviates the alignment–uniformity conflict, addressing limitations in InfoNCE-based alignment. 2. Leveraging distribution-level alignment, the method naturally supports unpaired data and multi-caption supervision, offering greater flexibility and scalability for real-world applications. 3. The use of lightweight adapters (e.g., Adapter, LoRA) enables efficient align
1. Although the method is well-motivated, the theoretical analysis of convergence, generalization, or stability of CS-Aligner remains shallow and largely empirical. 2. The KDE-based CS divergence may introduce additional computational overhead and sensitivity to kernel bandwidth selection, which is not systematically analyzed. 3. Some comparisons (e.g., Eclipse, LLM2CLIP) reuse decoders from other works, which may not fully isolate alignment quality from downstream generative components. 4. The
1. Clearly identifies InfoNCE’s limitations—its conflation of alignment and uniformity objectives and dependence on strong negatives that weaken cross-modal consistency. 2. Introduces CS divergence as a novel, theoretically grounded approach for distributional alignment, effectively addressing the shortcomings of sample-level contrastive methods.
1. The experiments are mainly focused on t2i and i2t tasks, which are not sufficient to verify the effectivess. 2. The ablation studies in Appendix H.1 are insufficient and unconvincing. The experimental settings are not clearly defined. Furthermore, the paper fails to adequately explain the core concepts of InfoNCE and CS-Aligner, and does not justify the significant performance degradation observed when using only InfoNCE. While these experiments are in the appendix, it is essential to tr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOptics and Image Analysis
MethodsInfoNCE · Contrastive Language-Image Pre-training
