Factored Latent Action World Models
Zizhao Wang, Chang Shi, Jiaheng Hu, Kevin Rohling, Roberto Mart\'in-Mart\'in, Amy Zhang, Peter Stone

TL;DR
The paper introduces FLAM, a factored latent action model that decomposes scenes into independent factors, enabling better modeling of multi-entity dynamics and improved video generation in action-free videos.
Contribution
It proposes a novel factored dynamics framework that infers separate latent actions for different scene factors, enhancing modeling accuracy in complex multi-entity environments.
Findings
FLAM outperforms prior models in prediction accuracy.
FLAM improves representation quality.
Facilitates downstream policy learning.
Abstract
Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world model learning. Latent actions provide a natural interface for users to iteratively generate and manipulate videos. However, most existing approaches rely on monolithic inverse and forward dynamics models that learn a single latent action to control the entire scene, and therefore struggle in complex environments where multiple entities act simultaneously. This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors, each inferring its own latent action and predicting its own next-step factor value. This factorized structure enables more accurate modeling of complex multi-entity dynamics and improves video generation quality in action-free video settings compared to monolithic models. Based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation
