TL;DR
EB-JEPA is an open-source library that enables efficient, self-supervised learning of representations for images, videos, and action-conditioned world models, emphasizing modularity and accessibility.
Contribution
The paper introduces EB-JEPA, a modular library that adapts joint-embedding predictive architectures for diverse modalities, including video and action-conditioned world modeling.
Findings
Probes achieve 91% accuracy on CIFAR-10.
Multi-step prediction on Moving MNIST demonstrates temporal modeling.
Action-conditioned world models reach 97% planning success on Two Rooms.
Abstract
We present EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEPAs). JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling while capturing semantically meaningful features suitable for downstream tasks. Our library provides modular, self-contained implementations that illustrate how representation learning techniques developed for image-level self-supervised learning can transfer to video, where temporal dynamics add complexity, and ultimately to action-conditioned world models, where the model must additionally learn to predict the effects of control inputs. Each example is designed for single-GPU training within a few hours, making energy-based self-supervised learning accessible for research and education. We provide ablations of JEA components on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
