Probabilistic Variational Contrastive Learning
Minoh Jeong, Seonho Kim, Alfred Hero

TL;DR
This paper introduces Variational Contrastive Learning (VCL), a probabilistic framework that enhances contrastive learning by providing uncertainty quantification and mitigating issues like dimensional collapse, while maintaining high classification performance.
Contribution
VCL is a novel decoder-free probabilistic contrastive learning method that models embeddings as distributions, enabling uncertainty estimation and improved robustness.
Findings
VCL reduces dimensional collapse in embeddings.
VCL improves mutual information with class labels.
VCL matches or surpasses deterministic baselines in classification accuracy.
Abstract
Deterministic embeddings learned by contrastive learning (CL) methods such as SimCLR and SupCon achieve state-of-the-art performance but lack a principled mechanism for uncertainty quantification. We propose Variational Contrastive Learning (VCL), a decoder-free framework that maximizes the evidence lower bound (ELBO) by interpreting the InfoNCE loss as a surrogate reconstruction term and adding a KL divergence regularizer to a uniform prior on the unit hypersphere. We model the approximate posterior as a projected normal distribution, enabling the sampling of probabilistic embeddings. Our two instantiation--VSimCLR and VSupCon--replace deterministic embeddings with samples from and incorporate a normalized KL term into the loss. Experiments on multiple benchmarks demonstrate that VCL mitigates dimensional collapse, enhances mutual information with class…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The proposed method is leads to easy-to-implement objective. The empirical results shows some improvement compared to conventional contrastive learning methodology, especially in mitigating collapse phenomena.
1. The motivation of this paper is not compelling. Why is that the conventional contrastive learning employs a deterministic map to represent each sample a limitation? What is the meaning of the "uncertainty" is a representation when there is no "true" representative point in the latent space? How does the quantified uncertainty affect the downstream tasks? 2. The key approximation result (12) is not fully justified. It is based on the approximation (10), which in turn is based on Lemma 3.1. H
The paper provides a coherent theoretical reinterpretation of contrastive learning through a probabilistic lens. By formulating InfoNCE as a surrogate reconstruction term in an ELBO objective, it bridges a gap between variational inference and contrastive learning, offering a fresh theoretical grounding that could allow for further analytical work in this area. The introduction of a uniform spherical prior and projected normal posterior shows careful consideration of embedding geometry. This de
1. The paper contains several flaws in mathematical derivation. While the experimental framework remains sound, these issues require correction to ensure the theoretical contributions align with the stated goals of providing a rigorous connection between InfoNCE and ELBO. - **Incorrect sign propagation in the InfoNCE-ELBO connection** The paper's central theoretical contribution, minimizing InfoNCE asymptotically maximizes the ELBO reconstruction term, is incorrectly derived. In the proof of
1. The topic is highly relevant and timely, as **uncertainty modeling** and **probabilistic embeddings** are increasingly important in self-supervised and contrastive learning research. 2. The proposed approach is conceptually simple, easy to implement, and broadly compatible with existing contrastive frameworks without requiring architectural changes. 3. The paper is **clearly written** and well organized, with helpful figures, examples, and ablation studies that aid understanding. 4. The
## Major 1. **Issues in the theoretical derivations.** I believe several of the demonstrations contain conceptual or mathematical inaccuracies: - **Appendix B.1:** Since $z$ is a continuous variable, $H(r)$ represents a *differential entropy*, which can be negative. This invalidates the step in the proof that relies on $H(r) \ge 0$. - **Appendix C.1:** The assumption that $g$ is invertible seems unjustified. As stated in the paper, the representations are *compact versions* of t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Face recognition and analysis · Adversarial Robustness in Machine Learning
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Dense Connections · Random Gaussian Blur · Normalized Temperature-scaled Cross Entropy Loss · Feedforward Network · SimCLR · InfoNCE · Contrastive Learning
