UniDiffGrasp: A Unified Framework Integrating VLM Reasoning and VLM-Guided Part Diffusion for Open-Vocabulary Constrained Grasping with Dual Arms
Xueyang Guo, Hongwei Hu, Chengye Song, Jiale Chen, Zilin Zhao, Yu Fu, Bowen Guan, Zhenze Liu

TL;DR
UniDiffGrasp is a novel framework that combines vision-language reasoning with part-guided diffusion to enable precise, open-vocabulary, dual-arm grasping of functional parts in real-world environments.
Contribution
It introduces a unified approach integrating VLM reasoning with guided part diffusion for open-vocabulary, dual-arm grasping without retraining, improving accuracy and coordination.
Findings
Achieves 87.6% success in single-arm grasping
Achieves 76.7% success in dual-arm grasping
Outperforms existing state-of-the-art methods in real-world tests
Abstract
Open-vocabulary, task-oriented grasping of specific functional parts, particularly with dual arms, remains a key challenge, as current Vision-Language Models (VLMs), while enhancing task understanding, often struggle with precise grasp generation within defined constraints and effective dual-arm coordination. We innovatively propose UniDiffGrasp, a unified framework integrating VLM reasoning with guided part diffusion to address these limitations. UniDiffGrasp leverages a VLM to interpret user input and identify semantic targets (object, part(s), mode), which are then grounded via open-vocabulary segmentation. Critically, the identified parts directly provide geometric constraints for a Constrained Grasp Diffusion Field (CGDF) using its Part-Guided Diffusion, enabling efficient, high-quality 6-DoF grasps without retraining. For dual-arm tasks, UniDiffGrasp defines distinct target…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Motor Control and Adaptation · Multimodal Machine Learning Applications
