MolmoAct2: Action Reasoning Models for Real-world Deployment

Haoquan Fang; Jiafei Duan; Donovan Clay; Sam Wang; Shuo Liu; Weikai Huang; Xiang Fan; Wei-Chuan Tsai; Shirui Chen; Yi Ru Wang; Shanli Xing; Jaemin Cho; Jae Sung Park; Ainaz Eftekhar; Peter Sushko; Karen Farley; Angad Wadhwa; Cole Harrison; Winson Han; Ying-Chun Lee; Eli VanderBilt; Rose Hendrix; Suveen Ellawela; Lucas Ngoo; Joyce Chai; Zhongzheng Ren; Ali Farhadi; Dieter Fox; and Ranjay Krishna

arXiv:2605.02881·cs.RO·May 11, 2026

MolmoAct2: Action Reasoning Models for Real-world Deployment

Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu, Weikai Huang, Xiang Fan, Wei-Chuan Tsai, Shirui Chen, Yi Ru Wang, Shanli Xing, Jaemin Cho, Jae Sung Park, Ainaz Eftekhar, Peter Sushko, Karen Farley, Angad Wadhwa, Cole Harrison, Winson Han, Ying-Chun Lee

PDF

2 Repos 13 Models 50 Datasets

TL;DR

MolmoAct2 is an open, practical vision-language-action model for robots that advances reasoning, datasets, and architecture, outperforming prior models in extensive real-world and simulation benchmarks.

Contribution

It introduces MolmoER, a specialized VLM backbone, new datasets, an open-weight action tokenizer, and a novel architecture with flow-matching for continuous actions, plus an adaptive reasoning variant.

Findings

01

MolmoAct2 outperforms strong baselines in extensive benchmarks.

02

MolmoER surpasses GPT-5 and Gemini ER-1.5 in embodied reasoning.

03

The model and datasets are openly released for community use.

Abstract

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.