GeoLanG: Geometry-Aware Language-Guided Grasping with Unified RGB-D Multimodal Learning

Rui Tang; Guankun Wang; Long Bai; Huxin Gao; Jiewen Lai; Chi Kit Ng; Jiazheng Wang; Fan Zhang; Hongliang Ren

arXiv:2602.04231·cs.RO·February 5, 2026

GeoLanG: Geometry-Aware Language-Guided Grasping with Unified RGB-D Multimodal Learning

Rui Tang, Guankun Wang, Long Bai, Huxin Gao, Jiewen Lai, Chi Kit Ng, Jiazheng Wang, Fan Zhang, Hongliang Ren

PDF

Open Access

TL;DR

GeoLanG is an end-to-end framework that unifies visual and linguistic inputs using CLIP, enhanced by depth-guided geometric priors and adaptive feature integration, to improve language-guided robotic grasping in cluttered and occluded scenes.

Contribution

The paper introduces GeoLanG, a novel geometry-aware, multimodal learning framework that effectively combines RGB-D data and language instructions for robust robotic grasping.

Findings

01

Achieves precise grasping in cluttered environments

02

Demonstrates robustness in occluded and low-texture scenes

03

Outperforms existing methods on OCID-VLG dataset

Abstract

Language-guided grasping has emerged as a promising paradigm for enabling robots to identify and manipulate target objects through natural language instructions, yet it remains highly challenging in cluttered or occluded scenes. Existing methods often rely on multi-stage pipelines that separate object perception and grasping, which leads to limited cross-modal fusion, redundant computation, and poor generalization in cluttered, occluded, or low-texture scenes. To address these limitations, we propose GeoLanG, an end-to-end multi-task framework built upon the CLIP architecture that unifies visual and linguistic inputs into a shared representation space for robust semantic alignment and improved generalization. To enhance target discrimination under occlusion and low-texture conditions, we explore a more effective use of depth information through the Depth-guided Geometric Module (DGGM),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems