VideoWorld 2: Learning Transferable Knowledge from Real-world Videos

Zhongwei Ren; Yunchao Wei; Xiao Yu; Guixun Luo; Yao Zhao; Bingyi Kang; Jiashi Feng; Xiaojie Jin

arXiv:2602.10102·cs.CV·February 11, 2026

VideoWorld 2: Learning Transferable Knowledge from Real-world Videos

Zhongwei Ren, Yunchao Wei, Xiao Yu, Guixun Luo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin

PDF

Open Access

TL;DR

VideoWorld 2 introduces a novel method for learning transferable task knowledge directly from raw real-world videos, significantly improving performance in robotic manipulation tasks and enabling coherent long-horizon reasoning.

Contribution

It proposes a dynamic-enhanced Latent Dynamics Model that decouples action dynamics from visual appearance using a pretrained diffusion model, advancing video-based transfer learning.

Findings

01

Achieves up to 70% improvement in task success rate

02

Produces coherent long execution videos

03

Enhances robotic manipulation performance

Abstract

Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning