Learning Visual Feature-Based World Models via Residual Latent Action

Xinyu Zhang; Zhengtong Xu; Yutian Tao; Yeping Wang; Yu She; Abdeslam Boularias

arXiv:2605.07079·cs.CV·May 11, 2026

Learning Visual Feature-Based World Models via Residual Latent Action

Xinyu Zhang, Zhengtong Xu, Yutian Tao, Yeping Wang, Yu She, Abdeslam Boularias

PDF

2 Repos 1 Models 1 Datasets

TL;DR

This paper introduces Residual Latent Action (RLA), a novel predictive latent representation for visual feature-based world models, enabling faster and more accurate policy learning from offline videos.

Contribution

It proposes RLA-WM, a flow-matching world model using RLA, and demonstrates its effectiveness in simulation, real-world datasets, and robot policy learning from offline videos.

Findings

01

RLA-WM outperforms state-of-the-art models in accuracy and speed.

02

RLA enables learning from actionless demonstration videos.

03

Visual RL trained entirely offline achieves competitive performance.

Abstract

World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as *Residual Latent Action* (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose *RLA World Model* (RLA-WM), which predicts RLA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
xyzhang368/RLA-WM
model

Datasets

xyzhang368/RLA-WM
dataset· 103 dl
103 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.