Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning
Shentong Mo, Shengbang Tong

TL;DR
This paper introduces C-JEPA, a novel framework combining JEPA with VICReg to address collapse issues and improve visual representation learning, showing enhanced stability and performance on ImageNet-1K.
Contribution
The paper proposes C-JEPA, integrating contrastive regularization with JEPA to prevent collapse and improve learning stability, advancing unsupervised visual representation methods.
Findings
C-JEPA outperforms previous JEPA variants in stability and quality.
Pre-training on ImageNet-1K shows faster convergence and better accuracy.
C-JEPA effectively prevents collapse and learns better feature representations.
Abstract
In recent advancements in unsupervised visual representation learning, the Joint-Embedding Predictive Architecture (JEPA) has emerged as a significant method for extracting visual features from unlabeled imagery through an innovative masking strategy. Despite its success, two primary limitations have been identified: the inefficacy of Exponential Moving Average (EMA) from I-JEPA in preventing entire collapse and the inadequacy of I-JEPA prediction in accurately learning the mean of patch representations. Addressing these challenges, this study introduces a novel framework, namely C-JEPA (Contrastive-JEPA), which integrates the Image-based Joint-Embedding Predictive Architecture with the Variance-Invariance-Covariance Regularization (VICReg) strategy. This integration is designed to effectively learn the variance/covariance for preventing entire collapse and ensuring invariance in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies
