TL;DR
DexGraspVLA introduces a hierarchical vision-language-action framework that significantly improves general dexterous grasping in complex, unseen environments by leveraging foundation models and diffusion-based control.
Contribution
It presents a novel hierarchical framework combining pre-trained vision-language models and diffusion-based controllers for robust, generalizable dexterous grasping beyond prior restrictive assumptions.
Findings
Achieves over 90% success rate in unseen cluttered scenes.
Demonstrates robustness to adversarial objects and disturbances.
Enables free-form long-horizon prompt execution.
Abstract
Dexterous grasping remains a fundamental yet challenging problem in robotics. A general-purpose robot must be capable of grasping diverse objects in arbitrary scenarios. However, existing research typically relies on restrictive assumptions, such as single-object settings or limited environments, showing constrained generalization. We present DexGraspVLA, a hierarchical framework for robust generalization in language-guided general dexterous grasping and beyond. It utilizes a pre-trained Vision-Language model as the high-level planner and learns a diffusion-based low-level Action controller. The key insight to achieve generalization lies in iteratively transforming diverse language and visual inputs into domain-invariant representations via foundation models, where imitation learning can be effectively applied due to the alleviation of domain shift. Notably, our method achieves a 90+%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
