From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
Yajie Li, Bozhou Zhang, Chun Gu, Zipei Ma, Jiahui Zhang, Jiankang Deng, Xiatian Zhu, Li Zhang

TL;DR
This paper introduces MoLA, a control-oriented framework that converts imagined future videos into executable actions for robot manipulation by using a mixture of inverse dynamics models.
Contribution
MoLA leverages a mixture of pretrained inverse dynamics models to transform visual predictions into structured, action-centric representations, improving control stability and generalization.
Findings
Achieves higher task success rates on simulated benchmarks.
Improves temporal consistency in robot actions.
Demonstrates effective real-world robot manipulation.
Abstract
Video generation models offer a promising imagination mechanism for robot manipulation by predicting long-horizon future observations, but effectively exploiting these imagined futures for action execution remains challenging. Existing approaches either condition policies on predicted frames or directly decode generated videos into actions, both suffering from a mismatch between visual realism and control relevance. As a result, predicted observations emphasize perceptual fidelity rather than action-centric causes of state transitions, leading to indirect and unstable control. To address this gap, we propose MoLA (Mixture of Latent Actions), a control-oriented interface that transforms imagined future videos into executable representations. Instead of passing predicted frames directly to the policy, MoLA leverages a mixture of pretrained inverse dynamics models to infer a mixture of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
