Towards Open-World Grasping with Large Vision-Language Models

Georgios Tziafas; Hamidreza Kasaei

arXiv:2406.18722·cs.RO·October 15, 2024·1 cites

Towards Open-World Grasping with Large Vision-Language Models

Georgios Tziafas, Hamidreza Kasaei

PDF

Open Access 3 Reviews

TL;DR

This paper introduces OWG, a novel open-world robotic grasping system that leverages vision-language models for zero-shot, grounded reasoning about semantics and geometry, enabling robust grasping in complex, real-world scenarios.

Contribution

The work demonstrates that modern vision-language models can be effectively combined with segmentation and grasp synthesis for zero-shot, open-world grasping without external visual grounding or low-level spatial training.

Findings

01

OWG achieves robust zero-shot grasping in cluttered scenes.

02

The system outperforms previous supervised and zero-shot methods.

03

Extensive tests in simulation and hardware validate its effectiveness.

Abstract

The ability to grasp objects in-the-wild from open-ended language instructions constitutes a fundamental challenge in robotics. An open-world grasping system should be able to combine high-level contextual with low-level physical-geometric reasoning in order to be applicable in arbitrary scenarios. Recent works exploit the web-scale knowledge inherent in large language models (LLMs) to plan and reason in robotic context, but rely on external vision and action models to ground such knowledge into the environment and parameterize actuation. This setup suffers from two major bottlenecks: a) the LLM's reasoning capacity is constrained by the quality of visual grounding, and b) LLMs do not contain low-level spatial understanding of the world, which is essential for grasping in contact-rich scenarios. In this work we demonstrate that modern vision-language models (VLMs) are capable of…

Peer Reviews

Decision·CoRL 2024

Reviewer 01Rating 3Confidence 4

Reviewer 02Rating 4Confidence 4

Reviewer 03Rating 2Confidence 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling