UniHM: Unified Dexterous Hand Manipulation with Vision Language Model
Zhenhao Zhang, Jiaxin Liu, Ye Shi, Jingya Wang

TL;DR
UniHM is a novel framework enabling dexterous hand manipulation guided by open-vocabulary language commands, using a unified tokenizer, language-conditioned models, and physics-based refinement for realistic, generalizable robotic manipulation.
Contribution
It introduces a unified hand tokenizer, a vision-language action model trained on interaction data, and a physics-guided refinement module, advancing open-vocabulary dexterous manipulation.
Findings
Achieves state-of-the-art results on multiple datasets.
Demonstrates strong generalization to unseen objects and trajectories.
Produces smooth, physically feasible manipulation sequences.
Abstract
Planning physically feasible dexterous hand manipulation is a central challenge in robotic manipulation and Embodied AI. Prior work typically relies on object-centric cues or precise hand-object interaction sequences, foregoing the rich, compositional guidance of open-vocabulary instruction. We introduce UniHM, the first framework for unified dexterous hand manipulation guided by free-form language commands. We propose a Unified Hand-Dexterous Tokenizer that maps heterogeneous dexterous-hand morphologies into a single shared codebook, improving cross-dexterous hand generalization and scalability to new morphologies. Our vision language action model is trained solely on human-object interaction data, eliminating the need for massive real-world teleoperation datasets, and demonstrates strong generalizability in producing human-like manipulation sequences from open-ended language…
Peer Reviews
Decision·ICLR 2026 Poster
- First VLM for unified (multi-embodiment) dexterous manipulation - The VLM can be trained purely from human data - Performance seems to be competitive - Real robot experiments are convincing
- Some sections are unclear and missing detail (see questions) - The paper would benefit from further ablations. E.g. the different parts of the cost function. Or the benefits of using a unified latent space. Here, the (maybe naive) alternative would be to learn everything in a single space (e.g. MANO) and retarget the output of the VLM afterwards. Such a comparison would be insightful. - There is no further information on the costs of the optimization of the grasp. How long does it take? Is it
The paper is considering a relevant problem of generalization to different hand morphologies. The authors proposed a comprehensive framework that shows good results across multiple metrics and in real-world robot setups. They successfully manage to combine and benefit from existing models, such as PointSam and CLIPort, and integrate them into their framework. Overall, the approach is well-defined, and evaluation supports the claims.
The main weakness of the approach is the complexity of the method. Currently, the approach consists of multiple parts and many different pre-trained models that need to be finetuned. The authors do not provide code. It is unclear which simulation environment they have used. The authors mention some terms without providing sufficient explanation in their context or a reference to related work. Such terms are MANO poses (line 153), vector-quantization operator (line 187), knowledge distillatio
The paper proposes a strong framework that incorporates a cross-dexterous-hand representation, language-conditioned sequence generation, and a physics-guided dynamic trajectory refinement module. The authors perform extensive experiments across multiple datasets, covering both seen and unseen settings, with comprehensive evaluations and ablations.
Tokenizer evaluation limited to a single hand. Although a unified dexterous-hand tokenizer is proposed, both the HOI and real-world experiments appear to use only one hand type. A broader evaluation across multiple robot hands would more convincingly validate the tokenizer’s generality and cross-hand transfer. Underspecified sequence-generation metrics. The paper introduces a manipulation sequence generator, but the evaluation protocol for sequences is insufficiently detailed. Please clearly de
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition
