VL-Grasp: a 6-Dof Interactive Grasp Policy for Language-Oriented Objects in Cluttered Indoor Scenes
Yuhao Lu, Yixuan Fan, Beixing Deng, Fangfu Liu, Yali Li, Shengjin Wang

TL;DR
VL-Grasp is a novel 6-DOF interactive grasp policy enabling robots to locate and grasp target objects specified by human language in cluttered indoor scenes, demonstrating high success rates and robustness.
Contribution
The paper introduces a new visual grounding dataset, a 6-DOF interactive grasp policy, and a grasp pose filter, advancing language-based robotic grasping in complex environments.
Findings
Achieved 72.5% success rate in real-world indoor scenes.
Extended the universality of interactive grasping with 6-DOF policy.
Demonstrated effectiveness and extendibility of VL-Grasp.
Abstract
Robotic grasping faces new challenges in human-robot-interaction scenarios. We consider the task that the robot grasps a target object designated by human's language directives. The robot not only needs to locate a target based on vision-and-language information, but also needs to predict the reasonable grasp pose candidate at various views and postures. In this work, we propose a novel interactive grasp policy, named Visual-Lingual-Grasp (VL-Grasp), to grasp the target specified by human language. First, we build a new challenging visual grounding dataset to provide functional training data for robotic interactive perception in indoor environments. Second, we propose a 6-Dof interactive grasp policy combined with visual grounding and 6-Dof grasp pose detection to extend the universality of interactive grasping. Third, we design a grasp pose filter module to enhance the performance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robot Manipulation and Learning
