Connecting Joint-Embedding Predictive Architecture with Contrastive   Self-supervised Learning

Shentong Mo; Shengbang Tong

arXiv:2410.19560·cs.CV·October 28, 2024

Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

Shentong Mo, Shengbang Tong

PDF

Open Access

TL;DR

This paper introduces C-JEPA, a novel framework combining JEPA with VICReg to address collapse issues and improve visual representation learning, showing enhanced stability and performance on ImageNet-1K.

Contribution

The paper proposes C-JEPA, integrating contrastive regularization with JEPA to prevent collapse and improve learning stability, advancing unsupervised visual representation methods.

Findings

01

C-JEPA outperforms previous JEPA variants in stability and quality.

02

Pre-training on ImageNet-1K shows faster convergence and better accuracy.

03

C-JEPA effectively prevents collapse and learns better feature representations.

Abstract

In recent advancements in unsupervised visual representation learning, the Joint-Embedding Predictive Architecture (JEPA) has emerged as a significant method for extracting visual features from unlabeled imagery through an innovative masking strategy. Despite its success, two primary limitations have been identified: the inefficacy of Exponential Moving Average (EMA) from I-JEPA in preventing entire collapse and the inadequacy of I-JEPA prediction in accurately learning the mean of patch representations. Addressing these challenges, this study introduces a novel framework, namely C-JEPA (Contrastive-JEPA), which integrates the Image-based Joint-Embedding Predictive Architecture with the Variance-Invariance-Covariance Regularization (VICReg) strategy. This integration is designed to effectively learn the variance/covariance for preventing entire collapse and ensuring invariance in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies