HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
Qiuxuan Feng, Jiale Yu, Jiaming Liu, Yueru Jia, Zhuangzhe Wu, Hao Chen, Zezhong Qian, Shuo Gu, Peng Jia, Siwei Ma, Shanghang Zhang

TL;DR
HarmoWAM introduces an adaptive world model that unifies predictive and reactive control for robots, enabling both generalization and precise manipulation across unseen environments.
Contribution
The paper proposes HarmoWAM, a novel end-to-end world action model that adaptively combines predictive and reactive experts for improved robotic manipulation.
Findings
HarmoWAM achieves 33% and 29% improvements over prior models in zero-shot generalization.
It effectively handles variations in background, position, and object semantics in unseen environments.
The adaptive gating mechanism enables dynamic switching between control strategies.
Abstract
World Action Models (WAMs) have emerged as a promising paradigm for robot control by modeling physical dynamics. Current WAMs generally follow two paradigms: the "Imagine-then-Execute" approach, which uses video prediction to infer actions via inverse dynamics, and the "Joint Modeling" approach, which jointly models actions and video representations. Based on systematic experiments, we observe a fundamental trade-off between these paradigms: the former explicitly leverages world models for generalizable transit but lacks interaction precision, whereas the latter enables fine-grained, temporally coherent action generation but is constrained by the exploration space of the training distribution. Motivated by these findings, we propose HarmoWAM, an end-to-end WAM that fully leverages a world model to unify predictive and reactive control, enabling both generalizable transit and precise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
