Incentivizing Multimodal Reasoning in Large Models for Direct Robot Manipulation

Weiliang Tang; Dong Jing; Jia-Hui Pan; Zhiwu Lu; Yun-Hui Liu; Li Erran Li; Mingyu Ding; Chi-Wing Fu

arXiv:2505.12744·cs.AI·May 20, 2025

Incentivizing Multimodal Reasoning in Large Models for Direct Robot Manipulation

Weiliang Tang, Dong Jing, Jia-Hui Pan, Zhiwu Lu, Yun-Hui Liu, Li Erran Li, Mingyu Ding, Chi-Wing Fu

PDF

Open Access

TL;DR

This paper introduces ReasonManip, a large multimodal model that leverages advanced reasoning to improve robotic manipulation, demonstrating high generalizability, transferability, and interpretability through a novel task formulation and reinforcement learning.

Contribution

The paper presents a new approach that enables large multimodal models to directly infer robotic actions via reasoning, using a novel spatial representation and fine-tuning with dialogue-based datasets.

Findings

01

ReasonManip shows strong generalization to new environments and objects.

02

The model achieves effective sim-to-real transfer.

03

It provides transparent reasoning linking high-level decisions to low-level control.

Abstract

Recent Large Multimodal Models have demonstrated remarkable reasoning capabilities, especially in solving complex mathematical problems and realizing accurate spatial perception. Our key insight is that these emerging abilities can naturally extend to robotic manipulation by enabling LMMs to directly infer the next goal in language via reasoning, rather than relying on a separate action head. However, this paradigm meets two main challenges: i) How to make LMMs understand the spatial action space, and ii) How to fully exploit the reasoning capacity of LMMs in solving these tasks. To tackle the former challenge, we propose a novel task formulation, which inputs the current states of object parts and the gripper, and reformulates rotation by a new axis representation instead of traditional Euler angles. This representation is more compatible with spatial reasoning and easier to interpret…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics