Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models
Lauren Hyoseo Yoon, Yisong Yue, Been Kim

TL;DR
This paper introduces JAM, a method to explicitly align independent vision and language models by jointly training autoencoders, enabling shared semantics and improving multimodal integration.
Contribution
The paper proposes the Joint Autoencoder Modulator (JAM), a novel approach for aligning independently trained unimodal models through joint autoencoder training with specialized objectives.
Findings
JAM reliably induces alignment across independent models.
The multimodal Spread Loss outperforms contrastive methods.
Alignment effectiveness varies with layer depth and model scale.
Abstract
Independently trained vision and language models inhabit disjoint representational spaces, shaped by their respective modalities, objectives, and architectures. The Platonic Representation Hypothesis (PRH) suggests these models may nonetheless converge toward a shared statistical model of reality. This raises a fundamental question: can we move beyond post-hoc detection of such alignment and explicitly optimize for it? We argue this challenge is most critical in fine-grained contextual distinctions-where multiple descriptions share global semantics but differ in subtle compositional details. We address this with the Joint Autoencoder Modulator (JAM), which aligns frozen unimodal models by jointly training modality-specific autoencoders with coordinated reconstruction and cross-modal alignment objectives. We systematically evaluate JAM across three design axes: (i) alignment objectives,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language and cultural evolution · Generative Adversarial Networks and Image Synthesis
