Video Representation Learning with Joint-Embedding Predictive   Architectures

Katrina Drozdov; Ravid Shwartz-Ziv; Yann LeCun

arXiv:2412.10925·cs.CV·December 17, 2024

Video Representation Learning with Joint-Embedding Predictive Architectures

Katrina Drozdov, Ravid Shwartz-Ziv, Yann LeCun

PDF

Open Access

TL;DR

This paper introduces VJ-VCR, a self-supervised video representation learning method using joint-embedding predictive architecture with variance-covariance regularization, which captures high-level dynamics and uncertainty in videos.

Contribution

The paper proposes VJ-VCR, a novel architecture that prevents representation collapse and effectively captures high-level and uncertain information in videos.

Findings

01

VJ-VCR outperforms generative baselines on downstream tasks.

02

Representation contains high-level information about video dynamics.

03

Incorporating latent variables captures uncertainty in non-deterministic settings.

Abstract

Video representation learning is an increasingly important topic in machine learning research. We present Video JEPA with Variance-Covariance Regularization (VJ-VCR): a joint-embedding predictive architecture for self-supervised video representation learning that employs variance and covariance regularization to avoid representation collapse. We show that hidden representations from our VJ-VCR contain abstract, high-level information about the input data. Specifically, they outperform representations obtained from a generative baseline on downstream tasks that require understanding of the underlying dynamics of moving objects in the videos. Additionally, we explore different ways to incorporate latent variables into the VJ-VCR framework that capture information about uncertainty in the future in non-deterministic settings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Video Surveillance and Tracking Methods