CLASP: Closed-loop Asynchronous Spatial Perception for Open-vocabulary Desktop Object Grasping
Yiran Ling, Wenxuan Li, Siying Dong, Yize Zhang, Xiaoyao Huang, Jing Jiang, Ruonan Li, and Jie Liu

TL;DR
CLASP is a novel closed-loop framework that enhances robotic desktop object grasping by integrating multimodal perception, logical reasoning, and feedback to improve success rates and robustness in dynamic environments.
Contribution
The paper introduces CLASP, a new asynchronous closed-loop system with hierarchical perception and error correction, advancing open-vocabulary robotic grasping in complex settings.
Findings
Achieves 87.0% success rate in grasping tasks.
Demonstrates strong generalization across diverse objects.
Bridges the sim-to-real gap effectively.
Abstract
Robot grasping of desktop object is widely used in intelligent manufacturing, logistics, and agriculture.Although vision-language models (VLMs) show strong potential for robotic manipulation, their deployment in low-level grasping faces key challenges: scarce high-quality multimodal demonstrations, spatial hallucination caused by weak geometric grounding, and the fragility of open-loop execution in dynamic environments. To address these challenges, we propose Closed-Loop Asynchronous Spatial Perception(CLASP), a novel asynchronous closed-loop framework that integrates multimodal perception, logical reasoning, and state-reflective feedback. First, we design a Dual-Pathway Hierarchical Perception module that decouples high-level semantic intent from geometric grounding. The design guides the output of the inference model and the definite action tuples, reducing spatial illusions. Second,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
