UniDiffGrasp: A Unified Framework Integrating VLM Reasoning and VLM-Guided Part Diffusion for Open-Vocabulary Constrained Grasping with Dual Arms

Xueyang Guo; Hongwei Hu; Chengye Song; Jiale Chen; Zilin Zhao; Yu Fu; Bowen Guan; Zhenze Liu

arXiv:2505.06832·cs.RO·May 13, 2025

UniDiffGrasp: A Unified Framework Integrating VLM Reasoning and VLM-Guided Part Diffusion for Open-Vocabulary Constrained Grasping with Dual Arms

Xueyang Guo, Hongwei Hu, Chengye Song, Jiale Chen, Zilin Zhao, Yu Fu, Bowen Guan, Zhenze Liu

PDF

Open Access

TL;DR

UniDiffGrasp is a novel framework that combines vision-language reasoning with part-guided diffusion to enable precise, open-vocabulary, dual-arm grasping of functional parts in real-world environments.

Contribution

It introduces a unified approach integrating VLM reasoning with guided part diffusion for open-vocabulary, dual-arm grasping without retraining, improving accuracy and coordination.

Findings

01

Achieves 87.6% success in single-arm grasping

02

Achieves 76.7% success in dual-arm grasping

03

Outperforms existing state-of-the-art methods in real-world tests

Abstract

Open-vocabulary, task-oriented grasping of specific functional parts, particularly with dual arms, remains a key challenge, as current Vision-Language Models (VLMs), while enhancing task understanding, often struggle with precise grasp generation within defined constraints and effective dual-arm coordination. We innovatively propose UniDiffGrasp, a unified framework integrating VLM reasoning with guided part diffusion to address these limitations. UniDiffGrasp leverages a VLM to interpret user input and identify semantic targets (object, part(s), mode), which are then grounded via open-vocabulary segmentation. Critically, the identified parts directly provide geometric constraints for a Constrained Grasp Diffusion Field (CGDF) using its Part-Guided Diffusion, enabling efficient, high-quality 6-DoF grasps without retraining. For dual-arm tasks, UniDiffGrasp defines distinct target…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Motor Control and Adaptation · Multimodal Machine Learning Applications