UniHM: Unified Dexterous Hand Manipulation with Vision Language Model

Zhenhao Zhang; Jiaxin Liu; Ye Shi; Jingya Wang

arXiv:2603.00732·cs.RO·March 3, 2026

UniHM: Unified Dexterous Hand Manipulation with Vision Language Model

Zhenhao Zhang, Jiaxin Liu, Ye Shi, Jingya Wang

PDF

Open Access 3 Reviews

TL;DR

UniHM is a novel framework enabling dexterous hand manipulation guided by open-vocabulary language commands, using a unified tokenizer, language-conditioned models, and physics-based refinement for realistic, generalizable robotic manipulation.

Contribution

It introduces a unified hand tokenizer, a vision-language action model trained on interaction data, and a physics-guided refinement module, advancing open-vocabulary dexterous manipulation.

Findings

01

Achieves state-of-the-art results on multiple datasets.

02

Demonstrates strong generalization to unseen objects and trajectories.

03

Produces smooth, physically feasible manipulation sequences.

Abstract

Planning physically feasible dexterous hand manipulation is a central challenge in robotic manipulation and Embodied AI. Prior work typically relies on object-centric cues or precise hand-object interaction sequences, foregoing the rich, compositional guidance of open-vocabulary instruction. We introduce UniHM, the first framework for unified dexterous hand manipulation guided by free-form language commands. We propose a Unified Hand-Dexterous Tokenizer that maps heterogeneous dexterous-hand morphologies into a single shared codebook, improving cross-dexterous hand generalization and scalability to new morphologies. Our vision language action model is trained solely on human-object interaction data, eliminating the need for massive real-world teleoperation datasets, and demonstrates strong generalizability in producing human-like manipulation sequences from open-ended language…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- First VLM for unified (multi-embodiment) dexterous manipulation - The VLM can be trained purely from human data - Performance seems to be competitive - Real robot experiments are convincing

Weaknesses

- Some sections are unclear and missing detail (see questions) - The paper would benefit from further ablations. E.g. the different parts of the cost function. Or the benefits of using a unified latent space. Here, the (maybe naive) alternative would be to learn everything in a single space (e.g. MANO) and retarget the output of the VLM afterwards. Such a comparison would be insightful. - There is no further information on the costs of the optimization of the grasp. How long does it take? Is it

Reviewer 02Rating 8Confidence 2

Strengths

The paper is considering a relevant problem of generalization to different hand morphologies. The authors proposed a comprehensive framework that shows good results across multiple metrics and in real-world robot setups. They successfully manage to combine and benefit from existing models, such as PointSam and CLIPort, and integrate them into their framework. Overall, the approach is well-defined, and evaluation supports the claims.

Weaknesses

The main weakness of the approach is the complexity of the method. Currently, the approach consists of multiple parts and many different pre-trained models that need to be finetuned. The authors do not provide code. It is unclear which simulation environment they have used. The authors mention some terms without providing sufficient explanation in their context or a reference to related work. Such terms are MANO poses (line 153), vector-quantization operator (line 187), knowledge distillatio

Reviewer 03Rating 2Confidence 3

Strengths

The paper proposes a strong framework that incorporates a cross-dexterous-hand representation, language-conditioned sequence generation, and a physics-guided dynamic trajectory refinement module. The authors perform extensive experiments across multiple datasets, covering both seen and unseen settings, with comprehensive evaluations and ablations.

Weaknesses

Tokenizer evaluation limited to a single hand. Although a unified dexterous-hand tokenizer is proposed, both the HOI and real-world experiments appear to use only one hand type. A broader evaluation across multiple robot hands would more convincingly validate the tokenizer’s generality and cross-hand transfer. Underspecified sequence-generation metrics. The paper introduces a manipulation sequence generator, but the evaluation protocol for sequences is insufficiently detailed. Please clearly de

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition