ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models

Zirui Song; Guangxian Ouyang; Mingzhe Li; Yuheng Ji; Chenxi Wang; Zixiang Xu; Zeyu Zhang; Xiaoqing Zhang; Qian Jiang; Zhenhao Chen; Zhongzhi Li; Rui Yan; Xiuying Chen

arXiv:2505.16517·cs.RO·May 27, 2025

ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models

Zirui Song, Guangxian Ouyang, Mingzhe Li, Yuheng Ji, Chenxi Wang, Zixiang Xu, Zeyu Zhang, Xiaoqing Zhang, Qian Jiang, Zhenhao Chen, Zhongzhi Li, Rui Yan, Xiuying Chen

PDF

Open Access 1 Video

TL;DR

ManipLVM-R1 introduces a reinforcement learning framework with verifiable rewards to improve robotic manipulation using large vision-language models, reducing reliance on costly annotations and enhancing generalization and physical reasoning.

Contribution

It presents a novel RL-based approach with rule-based rewards that improve reasoning and generalization in robotic manipulation tasks, bypassing the need for extensive human annotations.

Findings

01

Enhanced localization of interaction regions.

02

Improved physical plausibility of action trajectories.

03

Better generalization to out-of-domain scenarios.

Abstract

Large Vision-Language Models (LVLMs) have recently advanced robotic manipulation by leveraging vision for scene perception and language for instruction following. However, existing methods rely heavily on costly human-annotated training datasets, which limits their generalization and causes them to struggle in out-of-domain (OOD) scenarios, reducing real-world adaptability. To address these challenges, we propose ManipLVM-R1, a novel reinforcement learning framework that replaces traditional supervision with Reinforcement Learning using Verifiable Rewards (RLVR). By directly optimizing for task-aligned outcomes, our method enhances generalization and physical reasoning while removing the dependence on costly annotations. Specifically, we design two rule-based reward functions targeting key robotic manipulation subtasks: an Affordance Perception Reward to enhance localization of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning