Entity-Centric World Models: Interaction-Aware Masking for Causal Video Prediction
Santosh Kumar Paidi

TL;DR
This paper introduces IA-JEPA, a self-supervised, interaction-aware masking strategy for video prediction that emphasizes physical interactions, significantly improving causal reasoning and latent space quality in world models.
Contribution
It proposes a novel motion-centric masking approach that enhances causal understanding in world models, outperforming standard methods on multiple benchmarks.
Findings
Achieves 14.26% accuracy on CLEVRER causal tasks, outperforming 3.22% of baselines.
Induces higher-entropy, more discriminative latent spaces (+10%).
Generalizes to real-world human actions and physical puzzles.
Abstract
Learning predictive world models from unlabelled video is a foundational challenge in artificial intelligence. While Joint Embedding Predictive Architectures (JEPA) have set new benchmarks in semantic classification, they often remain physics-blind, failing to capture the causal dynamics necessary for downstream reasoning. We hypothesize that this stems from standard patch-based masking strategies, which prioritize visual texture over rare but informative kinematic events. We propose Interaction-Aware JEPA (IA-JEPA), which utilizes a self-supervised motion-centric masking strategy to prioritize physical interactions. By specifically targeting entities engaged in collisions or momentum transfers, we force the architecture to reconstruct latent trajectories rather than static background features. Evaluated on the CLEVRER benchmark, IA-JEPA achieves 14.26% accuracy on causal reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
