VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

Jingwen Sun; Wenyao Zhang; Zekun Qi; Shaojie Ren; Zezhi Liu; Hanxin Zhu; Guangzhong Sun; Xin Jin; Zhibo Chen

arXiv:2602.10098·cs.RO·February 17, 2026

VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, Zhibo Chen

PDF

Open Access 1 Models

TL;DR

VLA-JEPA introduces a novel pretraining framework for vision-language-action models that predicts future states in latent space, improving robustness and generalization in video-based tasks by avoiding appearance bias and nuisance motion.

Contribution

It proposes a leakage-free latent state prediction method within a JEPA-style framework, simplifying training and enhancing robustness over prior latent-action approaches.

Findings

01

Achieves consistent performance gains on LIBERO and real-world manipulation tasks.

02

Learns dynamics abstractions robust to camera motion and background changes.

03

Simplifies pretraining with a two-stage process without multi-stage complexity.

Abstract

Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is leakage-free state prediction: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation -- future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe -- JEPA pretraining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ginwind/VLA-JEPA
model· ♡ 4
♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition