DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping

Yifan Zhong; Xuchuan Huang; Ruochong Li; Ceyao Zhang; Zhang Chen; Tianrui Guan; Fanlian Zeng; Ka Num Lui; Yuyao Ye; Yitao Liang; Yaodong Yang; and Yuanpei Chen

arXiv:2502.20900·cs.RO·November 18, 2025

DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping

Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Zhang Chen, Tianrui Guan, Fanlian Zeng, Ka Num Lui, Yuyao Ye, Yitao Liang, Yaodong Yang, and Yuanpei Chen

PDF

1 Video

TL;DR

DexGraspVLA introduces a hierarchical vision-language-action framework that significantly improves general dexterous grasping in complex, unseen environments by leveraging foundation models and diffusion-based control.

Contribution

It presents a novel hierarchical framework combining pre-trained vision-language models and diffusion-based controllers for robust, generalizable dexterous grasping beyond prior restrictive assumptions.

Findings

01

Achieves over 90% success rate in unseen cluttered scenes.

02

Demonstrates robustness to adversarial objects and disturbances.

03

Enables free-form long-horizon prompt execution.

Abstract

Dexterous grasping remains a fundamental yet challenging problem in robotics. A general-purpose robot must be capable of grasping diverse objects in arbitrary scenarios. However, existing research typically relies on restrictive assumptions, such as single-object settings or limited environments, showing constrained generalization. We present DexGraspVLA, a hierarchical framework for robust generalization in language-guided general dexterous grasping and beyond. It utilizes a pre-trained Vision-Language model as the high-level planner and learns a diffusion-based low-level Action controller. The key insight to achieve generalization lies in iteratively transforming diverse language and visual inputs into domain-invariant representations via foundation models, where imitation learning can be effectively applied due to the alleviation of domain shift. Notably, our method achieves a 90+%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping· underline