MolmoAct: Action Reasoning Models that can Reason in Space

Jason Lee; Jiafei Duan; Haoquan Fang; Yuquan Deng; Shuo Liu; Boyang Li; Bohan Fang; Jieyu Zhang; Yi Ru Wang; Sangho Lee; Winson Han; Wilbert Pumacay; Angelica Wu; Rose Hendrix; Karen Farley; Eli VanderBilt; Ali Farhadi; Dieter Fox; Ranjay Krishna

arXiv:2508.07917·cs.RO·September 19, 2025

MolmoAct: Action Reasoning Models that can Reason in Space

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, Ranjay Krishna

PDF

Open Access 10 Models 3 Datasets

TL;DR

MolmoAct is a novel robotic foundation model that integrates perception, planning, and control through a structured pipeline, enabling explainable, steerable, and generalizable robotic actions across simulation and real-world tasks.

Contribution

We introduce MolmoAct, the first action reasoning model that combines perception, planning, and control in a structured pipeline, and release a comprehensive dataset for robotic reasoning.

Findings

01

Achieves 70.5% zero-shot accuracy on visual matching tasks

02

Surpasses existing models in success rate on LIBERO and real-world tasks

03

Outperforms baselines in out-of-distribution generalization and human preference scores

Abstract

Reasoning is central to purposeful action, yet most robotic foundation models map perception and instructions directly to control, which limits adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), a class of robotic foundation models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior. MolmoAct-7B-D achieves strong performance across simulation and real-world settings: 70.5% zero-shot accuracy on SimplerEnv Visual Matching tasks, surpassing closed-source Pi-0 and GR00T N1.5; 86.6% average success on LIBERO, including an additional 6.3% gain over ThinkAct on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics