Mind to Hand: Purposeful Robotic Control via Embodied Reasoning

Peijun Tang; Shangjin Xie; Binyan Sun; Baifu Huang; Kuncheng Luo; Haotian Yang; Weiqi Jin; Jianan Wang

arXiv:2512.08580·cs.RO·December 11, 2025

Mind to Hand: Purposeful Robotic Control via Embodied Reasoning

Peijun Tang, Shangjin Xie, Binyan Sun, Baifu Huang, Kuncheng Luo, Haotian Yang, Weiqi Jin, Jianan Wang

PDF

Open Access

TL;DR

Lumo-1 is a unified vision-language-action model that advances embodied reasoning and robotic control by integrating pre-trained multimodal reasoning, structured training stages, and reinforcement learning, enabling robots to perform complex, generalist tasks with human-like understanding.

Contribution

The paper introduces Lumo-1, a novel three-stage training pipeline that combines vision-language pre-training, cross-embodiment data, and reinforcement learning to enhance robot reasoning and action alignment.

Findings

01

Lumo-1 outperforms baselines in embodied reasoning tasks.

02

It demonstrates strong generalization to new objects and environments.

03

Excels in long-horizon and natural language instruction tasks.

Abstract

Humans act with context and intention, with reasoning playing a central role. While internet-scale data has enabled broad reasoning capabilities in AI systems, grounding these abilities in physical action remains a major challenge. We introduce Lumo-1, a generalist vision-language-action (VLA) model that unifies robot reasoning ("mind") with robot action ("hand"). Our approach builds upon the general multi-modal reasoning capabilities of pre-trained vision-language models (VLMs), progressively extending them to embodied reasoning and action prediction, and ultimately towards structured reasoning and reasoning-action alignment. This results in a three-stage pre-training pipeline: (1) Continued VLM pre-training on curated vision-language data to enhance embodied reasoning skills such as planning, spatial understanding, and trajectory prediction; (2) Co-training on cross-embodiment robot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Action Observation and Synchronization · Robot Manipulation and Learning