VLAD-Grasp: Zero-shot Grasp Detection via Vision-Language Models
Manav Kulshrestha, S. Talha Bukhari, Damon Conover, Aniket Bera

TL;DR
VLAD-Grasp leverages vision-language models for zero-shot robotic grasp detection, eliminating the need for curated datasets and enabling generalization to real-world objects with competitive performance.
Contribution
The paper introduces a training-free, zero-shot grasp detection method using vision-language models, advancing robotic manipulation without dataset curation.
Findings
Achieves competitive accuracy on Cornell and Jacquard datasets.
Demonstrates successful zero-shot grasping on real-world objects.
Eliminates the need for curated grasp datasets.
Abstract
Robotic grasping is a fundamental capability for enabling autonomous manipulation, with usually infinite solutions. State-of-the-art approaches for grasping rely on learning from large-scale datasets comprising expert annotations of feasible grasps. Curating such datasets is challenging, and hence, learning-based methods are limited by the solution coverage of the dataset, and require retraining to handle novel objects. Towards this, we present VLAD-Grasp, a Vision-Language model Assisted zero-shot approach for Detecting Grasps. Our method (1) prompts a large vision-language model to generate a goal image where a virtual cylindrical proxy intersects the object's geometry, explicitly encoding an antipodal grasp axis in image space, then (2) predicts depth and segmentation to lift this generated image into 3D, and (3) aligns generated and observed object point clouds via principal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Motor Control and Adaptation · Reinforcement Learning in Robotics
