Learning Additively Compositional Latent Actions for Embodied AI
Hangxing Wei, Xiaoyu Chen, Chuheng Zhang, Tim Pearce, Jianyu Chen, Alex Lamb, Li Zhao, Jiang Bian

TL;DR
This paper introduces AC-LAM, a model that enforces additive compositional structure in latent actions, improving motion representation and policy learning in embodied AI from visual data.
Contribution
AC-LAM is the first to impose additive compositional priors on latent actions, leading to more structured and calibrated motion representations for embodied AI.
Findings
AC-LAM outperforms state-of-the-art LAMs in simulated and real-world tabletop tasks.
AC-LAM learns more structured, motion-specific, and displacement-calibrated latent actions.
Enforcing additive structure improves downstream policy learning.
Abstract
Latent action learning infers pseudo-action labels from visual transitions, providing an approach to leverage internet-scale video for embodied AI. However, most methods learn latent actions without structural priors that encode the additive, compositional structure of physical motion. As a result, latents often entangle irrelevant scene details or information about future observations with true state changes and miscalibrate motion magnitude. We introduce Additively Compositional Latent Action Model (AC-LAM), which enforces scene-wise additive composition structure over short horizons on the latent action space. These AC constraints encourage simple algebraic structure in the latent action space~(identity, inverse, cycle consistency) and suppress information that does not compose additively. Empirically, AC-LAM learns more structured, motion-specific, and displacement-calibrated latent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
