Learning Visual Feature-Based World Models via Residual Latent Action
Xinyu Zhang, Zhengtong Xu, Yutian Tao, Yeping Wang, Yu She, Abdeslam Boularias

TL;DR
This paper introduces Residual Latent Action (RLA), a novel predictive latent representation for visual feature-based world models, enabling faster and more accurate policy learning from offline videos.
Contribution
It proposes RLA-WM, a flow-matching world model using RLA, and demonstrates its effectiveness in simulation, real-world datasets, and robot policy learning from offline videos.
Findings
RLA-WM outperforms state-of-the-art models in accuracy and speed.
RLA enables learning from actionless demonstration videos.
Visual RL trained entirely offline achieves competitive performance.
Abstract
World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as *Residual Latent Action* (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose *RLA World Model* (RLA-WM), which predicts RLA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
