Text-Conditional JEPA for Learning Semantically Rich Visual Representations
Chen Huang, Xianhang Li, Vimal Thilak, Etai Littwin, Josh Susskind

TL;DR
This paper introduces Text-Conditional JEPA, a novel self-supervised learning method that uses image captions to improve semantic visual representations and outperform contrastive methods on various tasks.
Contribution
It proposes a new vision-language pretraining paradigm based on feature prediction conditioned on text, enhancing semantic learning and downstream task performance.
Findings
TC-JEPA improves downstream task performance.
Training stability and scalability are enhanced.
Outperforms contrastive methods on fine-grained visual tasks.
Abstract
Image-based Joint-Embedding Predictive Architecture (I-JEPA) offers a promising approach to visual self-supervised learning through masked feature prediction. However with the inherent visual uncertainty at masked positions, feature prediction remains challenging and may fail to learn semantic representations. In this work, we propose Text-Conditional JEPA (TC-JEPA) that uses image captions to reduce the prediction uncertainty. Specifically, we modulate the predicted patch features using a fine-grained text conditioner that computes sparse cross-attention over input text tokens. With such conditioning, patch features become predictable as a function of text, thus are more semantically meaningful. We show TC-JEPA improves downstream performance and training stability, with promising scaling properties. TC-JEPA also offers a new vision-language pretraining paradigm based on feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
