Olaf-World: Orienting Latent Actions for Video World Modeling

Yuxin Jiang; Yuchao Gu; Ivor W. Tsang; and Mike Zheng Shou

arXiv:2602.10104·cs.CV·February 11, 2026

Olaf-World: Orienting Latent Actions for Video World Modeling

Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, and Mike Zheng Shou

PDF

Open Access

TL;DR

Olaf-World introduces a novel sequence-level alignment method that improves latent action space structure in video world models, enabling better zero-shot transfer and data efficiency in control tasks from unlabeled videos.

Contribution

The paper proposes SeqΔ-REPA, a new control-effect alignment objective, and Olaf-World, a pipeline for pretraining action-conditioned video models from passive videos, addressing transferability issues.

Findings

01

Enhanced zero-shot action transfer capabilities.

02

More data-efficient adaptation to new control interfaces.

03

Structured latent action space learned from passive videos.

Abstract

Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq $Δ$ -REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Reinforcement Learning in Robotics