GraspMAS: Zero-Shot Language-driven Grasp Detection with Multi-Agent System

Quang Nguyen; Tri Le; Huy Nguyen; Thieu Vo; Tung D. Ta; Baoru Huang; Minh N. Vu; Anh Nguyen

arXiv:2506.18448·cs.RO·July 22, 2025

GraspMAS: Zero-Shot Language-driven Grasp Detection with Multi-Agent System

Quang Nguyen, Tri Le, Huy Nguyen, Thieu Vo, Tung D. Ta, Baoru Huang, Minh N. Vu, Anh Nguyen

PDF

TL;DR

GraspMAS introduces a multi-agent system that leverages natural language to improve robotic grasp detection, especially in complex and cluttered environments, without requiring domain-specific training or fine-tuning.

Contribution

This paper presents a novel multi-agent framework for language-driven grasp detection that enhances reasoning and decision-making in real-world scenarios, outperforming existing methods.

Findings

01

Significant performance improvements over baseline methods.

02

Effective in both simulation and real-world robot experiments.

03

Handles complex language instructions and cluttered environments.

Abstract

Language-driven grasp detection has the potential to revolutionize human-robot interaction by allowing robots to understand and execute grasping tasks based on natural language commands. However, existing approaches face two key challenges. First, they often struggle to interpret complex text instructions or operate ineffectively in densely cluttered environments. Second, most methods require a training or finetuning step to adapt to new domains, limiting their generation in real-world applications. In this paper, we introduce GraspMAS, a new multi-agent system framework for language-driven grasp detection. GraspMAS is designed to reason through ambiguities and improve decision-making in real-world scenarios. Our framework consists of three specialized agents: Planner, responsible for strategizing complex queries; Coder, which generates and executes source code; and Observer, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.