SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images
Junqiu Yu, Xinlin Ren, Yongchong Gu, Haitao Lin, Tianyu Wang, Yi Zhu,, Hang Xu, Yu-Gang Jiang, Xiangyang Xue, Yanwei Fu

TL;DR
SparseGrasp is a new robotic grasping system that efficiently uses sparse multi-view RGB images, incorporates semantic understanding, and rapidly updates scenes for better performance in dynamic environments.
Contribution
It introduces a novel approach combining 3D Gaussian Splatting, semantic awareness, and PCA-based feature compression for fast, scene-adaptive robotic grasping from sparse views.
Findings
Outperforms state-of-the-art in speed and adaptability
Enables multi-turn grasping in changeable environments
Maintains high fidelity with sparse supervision
Abstract
Language-guided robotic grasping is a rapidly advancing field where robots are instructed using human language to grasp specific objects. However, existing methods often depend on dense camera views and struggle to quickly update scenes, limiting their effectiveness in changeable environments. In contrast, we propose SparseGrasp, a novel open-vocabulary robotic grasping system that operates efficiently with sparse-view RGB images and handles scene updates fastly. Our system builds upon and significantly enhances existing computer vision modules in robotic learning. Specifically, SparseGrasp utilizes DUSt3R to generate a dense point cloud as the initialization for 3D Gaussian Splatting (3DGS), maintaining high fidelity even under sparse supervision. Importantly, SparseGrasp incorporates semantic awareness from recent vision foundation models. To further improve processing efficiency,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Hand Gesture Recognition Systems · Industrial Vision Systems and Defect Detection
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
