MolmoAct: Action Reasoning Models that can Reason in Space
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, Ranjay Krishna

TL;DR
MolmoAct is a novel robotic foundation model that integrates perception, planning, and control through a structured pipeline, enabling explainable, steerable, and generalizable robotic actions across simulation and real-world tasks.
Contribution
We introduce MolmoAct, the first action reasoning model that combines perception, planning, and control in a structured pipeline, and release a comprehensive dataset for robotic reasoning.
Findings
Achieves 70.5% zero-shot accuracy on visual matching tasks
Surpasses existing models in success rate on LIBERO and real-world tasks
Outperforms baselines in out-of-distribution generalization and human preference scores
Abstract
Reasoning is central to purposeful action, yet most robotic foundation models map perception and instructions directly to control, which limits adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), a class of robotic foundation models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior. MolmoAct-7B-D achieves strong performance across simulation and real-world settings: 70.5% zero-shot accuracy on SimplerEnv Visual Matching tasks, surpassing closed-source Pi-0 and GR00T N1.5; 86.6% average success on LIBERO, including an additional 6.3% gain over ThinkAct on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗allenai/MolmoAct-7B-D-Pretrain-0812model· 810 dl· ♡ 8810 dl♡ 8
- 🤗allenai/MolmoAct-7B-D-0812model· 798 dl· ♡ 53798 dl♡ 53
- 🤗allenai/MolmoAct-7B-D-Pretrain-RT-1-0812model· 123 dl· ♡ 6123 dl♡ 6
- 🤗allenai/MolmoAct-7B-O-0812model· 45 dl· ♡ 545 dl♡ 5
- 🤗allenai/MolmoAct-7B-D-LIBERO-Long-0812model· 1.7k dl1.7k dl
- 🤗allenai/MolmoAct-7B-D-LIBERO-Goal-0812model· 5.4k dl5.4k dl
- 🤗allenai/MolmoAct-7B-D-LIBERO-Object-0812model· 3.0k dl3.0k dl
- 🤗allenai/MolmoAct-7B-D-LIBERO-Spatial-0812model· 3.1k dl3.1k dl
- 🤗allenai/MolmoAct-7B-D-Captioner-0812model· 41 dl41 dl
- 🤗Droidcraft-OY/MolmoAct-7B-D-FP8model· 11 dl· ♡ 111 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics
