Factorized Latent Dynamics for Video JEPA: An Empirical Study of Auxiliary Objectives
Santosh Premi

TL;DR
This paper empirically investigates auxiliary objectives in Video-JEPA, revealing capacity trade-offs and proposing a factorized latent dynamics approach that improves performance on multiple video benchmarks.
Contribution
It provides a comprehensive empirical analysis of auxiliary objectives in Video-JEPA and introduces a factorized latent dynamics method that enhances downstream task performance.
Findings
Many auxiliary objectives cause capacity trade-offs between downstream tasks.
FWM-HW-LD improves ImageNet-100 by +5.92 percentage points.
FWM-HW-LD maintains performance on Diving-48 within 0.30 points.
Abstract
Joint-Embedding Predictive Architectures (JEPA) are a promising framework for self-supervised video representation learning, yet the behavior of auxiliary objectives in small-scale Video-JEPA training is not well characterized. We report a small-scale empirical study of 18 auxiliary objective variants for Video-JEPA across two pretraining regimes: single-dataset (UCF-101) and mixed-dataset (UCF-101 + Something-Something V2 + ImageNet-100). We evaluate frozen representations on three complementary benchmarks: Diving-48 (fine-grained motion), SomethingSomething V2 (temporal reasoning), and ImageNet-100 (appearance). Our experiments suggest that many auxiliary objectives exhibit capacity trade-offs: gains on one downstream capability often coincide with degradation on another. We then study FWM-HW-LD (Factorized World-Model with Hard-Region-Weighted Latent Dynamics), a training-time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
