Video Representation Learning with Joint-Embedding Predictive Architectures
Katrina Drozdov, Ravid Shwartz-Ziv, Yann LeCun

TL;DR
This paper introduces VJ-VCR, a self-supervised video representation learning method using joint-embedding predictive architecture with variance-covariance regularization, which captures high-level dynamics and uncertainty in videos.
Contribution
The paper proposes VJ-VCR, a novel architecture that prevents representation collapse and effectively captures high-level and uncertain information in videos.
Findings
VJ-VCR outperforms generative baselines on downstream tasks.
Representation contains high-level information about video dynamics.
Incorporating latent variables captures uncertainty in non-deterministic settings.
Abstract
Video representation learning is an increasingly important topic in machine learning research. We present Video JEPA with Variance-Covariance Regularization (VJ-VCR): a joint-embedding predictive architecture for self-supervised video representation learning that employs variance and covariance regularization to avoid representation collapse. We show that hidden representations from our VJ-VCR contain abstract, high-level information about the input data. Specifically, they outperform representations obtained from a generative baseline on downstream tasks that require understanding of the underlying dynamics of moving objects in the videos. Additionally, we explore different ways to incorporate latent variables into the VJ-VCR framework that capture information about uncertainty in the future in non-deterministic settings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Video Surveillance and Tracking Methods
