DiLA: Disentangled Latent Action World Models

Tianqiu Zhang; Muyang Lyu; Yufan Zhang; Fang Fang; Si Wu

arXiv:2605.15725·cs.CV·May 18, 2026

DiLA: Disentangled Latent Action World Models

Tianqiu Zhang, Muyang Lyu, Yufan Zhang, Fang Fang, Si Wu

PDF

1 Models

TL;DR

DiLA introduces a disentangled latent action world model that balances action abstraction with high-quality video generation, advancing self-supervised world modeling.

Contribution

The paper presents DiLA, a novel framework that co-evolves disentanglement and latent action learning to improve video generation and action interpretability.

Findings

01

DiLA achieves superior video generation quality.

02

It enables effective action transfer and visual planning.

03

The model offers enhanced manifold interpretability.

Abstract

Latent Action Models (LAMs) enable the learning of world models from unlabeled video by inferring abstract actions between consecutive frames. However, LAMs face a fundamental trade-off between action abstraction and generation fidelity. Existing methods typically circumvent this issue by using two-stage training with pre-trained world models or by limiting predictions to optical flow. In this paper, we introduce DiLA, a novel Disentangled Latent Action world model that aims to resolve this trade-off via content-structure disentanglement. Our key insight is that disentanglement and latent action learning are co-evolving: the predictive bottleneck inherent in latent action learning serves as a driving force for disentanglement, compelling the model to distill spatial layouts into the structure pathway while offloading visual details to a separate content pathway for generation. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
senngadaisuki/disentangled-latent-action-world-models
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.