OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
Yushan Liu, Peibo Sun, Shoujie Li, Yifan Xie, Lingfeng Zhang, Xintao Chao, Shiyuan Dong, Fang Chen, Xiao-Ping Zhang, Wenbo Ding

TL;DR
OA-WAM introduces an object-addressable world model that improves robot manipulation by explicitly representing and tracking individual objects, leading to enhanced robustness and performance in scene understanding and action execution.
Contribution
It proposes a novel object-addressable slot-based representation within world models, enabling better object-specific reasoning and manipulation under scene shifts.
Findings
Achieves 97.8% on LIBERO benchmark
Reaches state-of-the-art on LIBERO-Plus geometric axes
Demonstrates high swap-binding cosine similarity of 0.87 in interventions
Abstract
World Action Models (WAMs) enhance Vision-Language-Action policies by jointly predicting scene evolution and robot actions, but existing methods usually represent the predicted world as holistic images, video tokens, or global latents. These representations are difficult for an action decoder to address when an instruction refers to a particular object, especially under scene shifts where object identity is entangled with context. We propose OA-WAM, an Object-Addressable World Action Model for robust robot manipulation. OA-WAM decomposes each frame into N+1 slot states, with one robot slot and N object slots. Each slot contains a persistent address vector and a time-varying content vector, and is fused with text, image, proprioception, and past-action tokens in a block-causal sequence. A world head predicts next-frame slot states, while a flow-matching action head decodes a 16-step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
