Olaf-World: Orienting Latent Actions for Video World Modeling
Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, and Mike Zheng Shou

TL;DR
Olaf-World introduces a novel sequence-level alignment method that improves latent action space structure in video world models, enabling better zero-shot transfer and data efficiency in control tasks from unlabeled videos.
Contribution
The paper proposes SeqΔ-REPA, a new control-effect alignment objective, and Olaf-World, a pipeline for pretraining action-conditioned video models from passive videos, addressing transferability issues.
Findings
Enhanced zero-shot action transfer capabilities.
More data-efficient adaptation to new control interfaces.
Structured latent action space learned from passive videos.
Abstract
Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Reinforcement Learning in Robotics
