Co-Evolving Latent Action World Models
Yucen Wang, Fengming Zhang, De-Chuan Zhan, Li Zhao, Kaixin Wang, and Jiang Bian

TL;DR
CoLA-World introduces a joint training paradigm for latent action world models, enabling co-evolution of models and improved video simulation and planning performance.
Contribution
It proposes a novel method for jointly training latent action models with pretrained world models, overcoming representational collapse and enhancing control and simulation quality.
Findings
Matches or outperforms prior methods in video simulation quality.
Improves downstream visual planning performance.
Successfully implements co-evolution of models through a warm-up phase.
Abstract
Adapting pretrained video generation models into controllable world models via latent actions is a promising step towards creating generalist world models. The dominant paradigm adopts a two-stage approach that trains latent action model (LAM) and the world model separately, resulting in redundant training and limiting their potential for co-adaptation. A conceptually simple and appealing idea is to directly replace the forward dynamic model in LAM with a powerful world model and training them jointly, but it is non-trivial and prone to representational collapse. In this work, we propose CoLA-World, which for the first time successfully realizes this synergistic paradigm, resolving the core challenge in joint learning through a critical warm-up phase that effectively aligns the representations of the from-scratch LAM with the pretrained world model. This unlocks a co-evolution cycle:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
