Towards Open-World Grasping with Large Vision-Language Models
Georgios Tziafas, Hamidreza Kasaei

TL;DR
This paper introduces OWG, a novel open-world robotic grasping system that leverages vision-language models for zero-shot, grounded reasoning about semantics and geometry, enabling robust grasping in complex, real-world scenarios.
Contribution
The work demonstrates that modern vision-language models can be effectively combined with segmentation and grasp synthesis for zero-shot, open-world grasping without external visual grounding or low-level spatial training.
Findings
OWG achieves robust zero-shot grasping in cluttered scenes.
The system outperforms previous supervised and zero-shot methods.
Extensive tests in simulation and hardware validate its effectiveness.
Abstract
The ability to grasp objects in-the-wild from open-ended language instructions constitutes a fundamental challenge in robotics. An open-world grasping system should be able to combine high-level contextual with low-level physical-geometric reasoning in order to be applicable in arbitrary scenarios. Recent works exploit the web-scale knowledge inherent in large language models (LLMs) to plan and reason in robotic context, but rely on external vision and action models to ground such knowledge into the environment and parameterize actuation. This setup suffers from two major bottlenecks: a) the LLM's reasoning capacity is constrained by the quality of visual grounding, and b) LLMs do not contain low-level spatial understanding of the world, which is essential for grasping in contact-rich scenarios. In this work we demonstrate that modern vision-language models (VLMs) are capable of…
Peer Reviews
Decision·CoRL 2024
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
