MapleGrasp: Mask-guided Feature Pooling for Language-driven Efficient Robotic Grasping
Vineet Bhat, Naman Patel, Prashanth Krishnamurthy, Ramesh Karri, Farshad Khorrami

TL;DR
MapleGrasp introduces a mask-guided feature pooling framework for language-driven robotic grasping, improving efficiency and accuracy in unseen object manipulation through vision-language integration and a new large-scale dataset.
Contribution
The paper presents a novel mask-guided feature pooling method and a large open-source dataset, enhancing generalization and efficiency in language-driven robotic grasping tasks.
Findings
7% improvement over prior approaches on OCID-VLG benchmark
89% grasping accuracy on RefGraspNet
73% success rate in real-world experiments with unseen objects
Abstract
Robotic manipulation of unseen objects via natural language commands remains challenging. Language driven robotic grasping (LDRG) predicts stable grasp poses from natural language queries and RGB-D images. We propose MapleGrasp, a novel framework that leverages mask-guided feature pooling for efficient vision-language driven grasping. Our two-stage training first predicts segmentation masks from CLIP-based vision-language features. The second stage pools features within these masks to generate pixel-level grasp predictions, improving efficiency, and reducing computation. Incorporating mask pooling results in a 7% improvement over prior approaches on the OCID-VLG benchmark. Furthermore, we introduce RefGraspNet, an open-source dataset eight times larger than existing alternatives, significantly enhancing model generalization for open-vocabulary grasping. MapleGrasp scores a strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Motor Control and Adaptation · Multimodal Machine Learning Applications
