ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

Haichao Zhang; Yijiang Li; Shwai He; Tushar Nagarajan; Mingfei Chen; Jianglin Lu; Ang Li; Yun Fu

arXiv:2603.22281·cs.CV·March 24, 2026

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, Yun Fu

PDF

Open Access

TL;DR

This paper introduces a VLM-guided latent world modeling framework that combines dense motion prediction with semantic reasoning, improving long-horizon forecasting in video-based tasks.

Contribution

It proposes a dual-temporal pathway model integrating dense JEPA dynamics with a large-scale vision-language model for semantic guidance, enhancing long-term prediction accuracy.

Findings

01

Outperforms baseline models in hand-manipulation trajectory prediction

02

Achieves more robust long-horizon rollout behavior

03

Effectively integrates semantic guidance into dense motion modeling

Abstract

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition