Non-Contrastive Vision-Language Learning with Predictive Embedding Alignment

Lukas Kuhn; Giuseppe Serra; Florian Buettner

arXiv:2602.00653·cs.CV·February 12, 2026

Non-Contrastive Vision-Language Learning with Predictive Embedding Alignment

Lukas Kuhn, Giuseppe Serra, Florian Buettner

PDF

Open Access

TL;DR

NOVA introduces a non-contrastive vision-language alignment method that simplifies training by predicting text embeddings from images, eliminating the need for negative sampling and hyperparameter tuning, and achieves superior zero-shot classification results.

Contribution

It presents NOVA, a novel non-contrastive framework for vision-language learning that simplifies training and improves stability and performance over contrastive methods.

Findings

01

Outperforms standard baselines on zero-shot chest X-ray classification

02

Exhibits more consistent training runs

03

Reduces training complexity with a single hyperparameter

Abstract

Vision-language models have transformed multimodal representation learning, yet dominant contrastive approaches like CLIP require large batch sizes, careful negative sampling, and extensive hyperparameter tuning. We introduce NOVA, a NOn-contrastive Vision-language Alignment framework based on joint embedding prediction with distributional regularization. NOVA aligns visual representations to a frozen, domain-specific text encoder by predicting text embeddings from augmented image views, while enforcing an isotropic Gaussian structure via Sketched Isotropic Gaussian Regularization (SIGReg). This eliminates the need for negative sampling, momentum encoders, or stop-gradients, reducing the training objective to a single hyperparameter. We evaluate NOVA on zeroshot chest X-ray classification using ClinicalBERT as the text encoder and Vision Transformers trained from scratch on MIMIC-CXR.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCOVID-19 diagnosis using AI · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning