V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran; Adrien Bardes; David Fan; Quentin Garrido; Russell Howes; Mojtaba; Komeili; Matthew Muckley; Ammar Rizvi; Claire Roberts; Koustuv Sinha; Artem Zholus; Sergio Arnaud; Abha Gejji; Ada Martin; Francois Robert Hogan; Daniel Dugas; Piotr Bojanowski; Vasil Khalidov; Patrick Labatut; Francisco Massa; Marc Szafraniec; Kapil Krishnakumar; Yong Li; Xiaodong Ma; Sarath Chandar; Franziska Meier; Yann LeCun; Michael Rabbat; Nicolas Ballas

arXiv:2506.09985·cs.AI·June 12, 2025

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces V-JEPA 2, a self-supervised video model trained on internet-scale data that achieves state-of-the-art performance in understanding, predicting, and planning in physical environments, including robotic tasks, without task-specific training.

Contribution

The paper presents V-JEPA 2, a novel self-supervised video model that effectively combines large-scale internet video data with minimal robot interaction data for diverse understanding and planning tasks.

Findings

01

Achieves 77.3 top-1 accuracy on motion understanding

02

Sets new state-of-the-art in human action anticipation with 39.7 recall-at-5

03

Enables zero-shot robotic object manipulation without task-specific data

Abstract

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/vjepa2
pytorchOfficial

Models

🤗
irfanalee/worldguard
model

Datasets

ckadirt/vjxla
dataset· 453 dl
453 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Social Robot Interaction and HRI