SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Simon Roschmann; Paul Krzakala; Sonia Mazelet; Quentin Bouniot; Zeynep Akata

arXiv:2602.23353·cs.LG·February 27, 2026

SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Simon Roschmann, Paul Krzakala, Sonia Mazelet, Quentin Bouniot, Zeynep Akata

PDF

Open Access

TL;DR

SOTAlign is a semi-supervised method that aligns vision and language models using minimal paired data and unpaired data, employing optimal transport to improve joint embeddings.

Contribution

It introduces a novel two-stage framework that effectively leverages unpaired data for aligning pretrained unimodal models with limited supervision.

Findings

01

Outperforms supervised and semi-supervised baselines in alignment quality.

02

Effectively leverages unpaired images and text for robust joint embeddings.

03

Significantly reduces the need for large paired datasets.

Abstract

The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling