AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

Liaoyuan Fan; Zetian Xu; Chen Cao; Wenyao Zhang; Mingqi Yuan; Jiayu Chen

arXiv:2604.11135·cs.RO·April 14, 2026

AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

Liaoyuan Fan, Zetian Xu, Chen Cao, Wenyao Zhang, Mingqi Yuan, Jiayu Chen

PDF

1 Models

TL;DR

AIM introduces an intent-aware spatial value map approach built on pretrained video models to improve robot manipulation success, especially in complex tasks.

Contribution

It proposes a novel spatial interface and a training framework that significantly enhance unified world action modeling for robot control.

Findings

01

Achieves 94.0% success rate on RoboTwin 2.0 benchmark.

02

Outperforms prior methods, especially in long-horizon tasks.

03

Demonstrates the effectiveness of explicit spatial-intent modeling.

Abstract

Pretrained video generation models provide strong priors for robot control, but existing unified world action models still struggle to decode reliable actions without substantial robot-specific training. We attribute this limitation to a structural mismatch: while video models capture how scenes evolve, action generation requires explicit reasoning about where to interact and the underlying manipulation intent. We introduce AIM, an intent-aware unified world action model that bridges this gap via an explicit spatial interface. Instead of decoding actions directly from future visual representations, AIM predicts an aligned spatial value map that encodes task-relevant interaction structure, enabling a control-oriented abstraction of future dynamics. Built on a pretrained video generation model, AIM jointly models future observations and value maps within a shared mixture-of-transformers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
AUTMOEN999/AIM
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.