Factorized Latent Dynamics for Video JEPA: An Empirical Study of Auxiliary Objectives

Santosh Premi

arXiv:2605.17165·cs.CV·May 19, 2026

Factorized Latent Dynamics for Video JEPA: An Empirical Study of Auxiliary Objectives

Santosh Premi

PDF

TL;DR

This paper empirically investigates auxiliary objectives in Video-JEPA, revealing capacity trade-offs and proposing a factorized latent dynamics approach that improves performance on multiple video benchmarks.

Contribution

It provides a comprehensive empirical analysis of auxiliary objectives in Video-JEPA and introduces a factorized latent dynamics method that enhances downstream task performance.

Findings

01

Many auxiliary objectives cause capacity trade-offs between downstream tasks.

02

FWM-HW-LD improves ImageNet-100 by +5.92 percentage points.

03

FWM-HW-LD maintains performance on Diving-48 within 0.30 points.

Abstract

Joint-Embedding Predictive Architectures (JEPA) are a promising framework for self-supervised video representation learning, yet the behavior of auxiliary objectives in small-scale Video-JEPA training is not well characterized. We report a small-scale empirical study of 18 auxiliary objective variants for Video-JEPA across two pretraining regimes: single-dataset (UCF-101) and mixed-dataset (UCF-101 + Something-Something V2 + ImageNet-100). We evaluate frozen representations on three complementary benchmarks: Diving-48 (fine-grained motion), SomethingSomething V2 (temporal reasoning), and ImageNet-100 (appearance). Our experiments suggest that many auxiliary objectives exhibit capacity trade-offs: gains on one downstream capability often coincide with degradation on another. We then study FWM-HW-LD (Factorized World-Model with Hard-Region-Weighted Latent Dynamics), a training-time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.