OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

Yushan Liu; Peibo Sun; Shoujie Li; Yifan Xie; Lingfeng Zhang; Xintao Chao; Shiyuan Dong; Fang Chen; Xiao-Ping Zhang; Wenbo Ding

arXiv:2605.06481·cs.RO·May 8, 2026

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

Yushan Liu, Peibo Sun, Shoujie Li, Yifan Xie, Lingfeng Zhang, Xintao Chao, Shiyuan Dong, Fang Chen, Xiao-Ping Zhang, Wenbo Ding

PDF

TL;DR

OA-WAM introduces an object-addressable world model that improves robot manipulation by explicitly representing and tracking individual objects, leading to enhanced robustness and performance in scene understanding and action execution.

Contribution

It proposes a novel object-addressable slot-based representation within world models, enabling better object-specific reasoning and manipulation under scene shifts.

Findings

01

Achieves 97.8% on LIBERO benchmark

02

Reaches state-of-the-art on LIBERO-Plus geometric axes

03

Demonstrates high swap-binding cosine similarity of 0.87 in interventions

Abstract

World Action Models (WAMs) enhance Vision-Language-Action policies by jointly predicting scene evolution and robot actions, but existing methods usually represent the predicted world as holistic images, video tokens, or global latents. These representations are difficult for an action decoder to address when an instruction refers to a particular object, especially under scene shifts where object identity is entangled with context. We propose OA-WAM, an Object-Addressable World Action Model for robust robot manipulation. OA-WAM decomposes each frame into N+1 slot states, with one robot slot and N object slots. Each slot contains a persistent address vector and a time-varying content vector, and is fused with text, image, proprioception, and past-action tokens in a block-causal sequence. A world head predicts next-frame slot states, while a flow-matching action head decodes a 16-step…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.