GeoLanG: Geometry-Aware Language-Guided Grasping with Unified RGB-D Multimodal Learning
Rui Tang, Guankun Wang, Long Bai, Huxin Gao, Jiewen Lai, Chi Kit Ng, Jiazheng Wang, Fan Zhang, Hongliang Ren

TL;DR
GeoLanG is an end-to-end framework that unifies visual and linguistic inputs using CLIP, enhanced by depth-guided geometric priors and adaptive feature integration, to improve language-guided robotic grasping in cluttered and occluded scenes.
Contribution
The paper introduces GeoLanG, a novel geometry-aware, multimodal learning framework that effectively combines RGB-D data and language instructions for robust robotic grasping.
Findings
Achieves precise grasping in cluttered environments
Demonstrates robustness in occluded and low-texture scenes
Outperforms existing methods on OCID-VLG dataset
Abstract
Language-guided grasping has emerged as a promising paradigm for enabling robots to identify and manipulate target objects through natural language instructions, yet it remains highly challenging in cluttered or occluded scenes. Existing methods often rely on multi-stage pipelines that separate object perception and grasping, which leads to limited cross-modal fusion, redundant computation, and poor generalization in cluttered, occluded, or low-texture scenes. To address these limitations, we propose GeoLanG, an end-to-end multi-task framework built upon the CLIP architecture that unifies visual and linguistic inputs into a shared representation space for robust semantic alignment and improved generalization. To enhance target discrimination under occlusion and low-texture conditions, we explore a more effective use of depth information through the Depth-guided Geometric Module (DGGM),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems
