SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from   Sparse Multi-View RGB Images

Junqiu Yu; Xinlin Ren; Yongchong Gu; Haitao Lin; Tianyu Wang; Yi Zhu,; Hang Xu; Yu-Gang Jiang; Xiangyang Xue; Yanwei Fu

arXiv:2412.02140·cs.RO·December 4, 2024

SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images

Junqiu Yu, Xinlin Ren, Yongchong Gu, Haitao Lin, Tianyu Wang, Yi Zhu,, Hang Xu, Yu-Gang Jiang, Xiangyang Xue, Yanwei Fu

PDF

Open Access

TL;DR

SparseGrasp is a new robotic grasping system that efficiently uses sparse multi-view RGB images, incorporates semantic understanding, and rapidly updates scenes for better performance in dynamic environments.

Contribution

It introduces a novel approach combining 3D Gaussian Splatting, semantic awareness, and PCA-based feature compression for fast, scene-adaptive robotic grasping from sparse views.

Findings

01

Outperforms state-of-the-art in speed and adaptability

02

Enables multi-turn grasping in changeable environments

03

Maintains high fidelity with sparse supervision

Abstract

Language-guided robotic grasping is a rapidly advancing field where robots are instructed using human language to grasp specific objects. However, existing methods often depend on dense camera views and struggle to quickly update scenes, limiting their effectiveness in changeable environments. In contrast, we propose SparseGrasp, a novel open-vocabulary robotic grasping system that operates efficiently with sparse-view RGB images and handles scene updates fastly. Our system builds upon and significantly enhances existing computer vision modules in robotic learning. Specifically, SparseGrasp utilizes DUSt3R to generate a dense point cloud as the initialization for 3D Gaussian Splatting (3DGS), maintaining high fidelity even under sparse supervision. Importantly, SparseGrasp incorporates semantic awareness from recent vision foundation models. To further improve processing efficiency,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Hand Gesture Recognition Systems · Industrial Vision Systems and Defect Detection

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings