UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling
Zhihao Sun, Tong Wu, Ruirui Tu, Daoguo Dong, Zuxuan Wu

TL;DR
UniHand is a unified diffusion-based framework that models 4D hand motion by integrating estimation and generation tasks, effectively handling occlusions and incomplete data through a shared latent space.
Contribution
It introduces a joint variational autoencoder and a latent diffusion model to unify hand motion estimation and generation from heterogeneous conditions.
Findings
Robust performance under severe occlusions.
Accurate motion synthesis from incomplete inputs.
Effective integration of diverse condition signals.
Abstract
Hand motion plays a central role in human interaction, yet modeling realistic 4D hand motion (i.e., 3D hand pose sequences over time) remains challenging. Research in this area is typically divided into two tasks: (1) Estimation approaches reconstruct precise motion from visual observations, but often fail under hand occlusion or absence; (2) Generation approaches focus on synthesizing hand poses by exploiting generative priors under multi-modal structured inputs and infilling motion from incomplete sequences. However, this separation not only limits the effective use of heterogeneous condition signals that frequently arise in practice, but also prevents knowledge transfer between the two tasks. We present UniHand, a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis. UniHand integrates heterogeneous inputs by embedding…
Peer Reviews
Decision·ICLR 2026 Poster
The proposed method shines when evaluated against numerous baselines, even in the presence of significant occlusion (Table 1). It is able to handle multimodal conditioning input, making it flexible and able to benefit from various types of known information at inference time. The proposed method is thoroughly ablated with respect to its components and possible input modalities. The work includes an honest discussion of its limitations.
The submission could benefit from more qualitative examples in the supplementary material. This is especially relevant for generative models.
This paper is well motivated: it aims to bring the estimation and the generation model together. The proposed model indeed can do these two task in a unified way. This paper does not propose a new problem formulation, but the effort to introduce an unified solution for two problems are interesting and valuable. The writing is clear in general: we can understand how the proposed framework achieve the proposed goal at the high level.
The main issue of this paper is it does not address the proposed goal: unifying both estimation *and generation*. The generation ability of the framework is not tested/reported. The paper is motivated by unifying the estimation and generation into the same framework. The motivation is sound, and the de-noiser in the framework is indeed a generative model. However, all results are reported **only on the hand pose estimation task**. The sole focus of estimation deviates from what is described in
- The paper proposes a unified diffusion-based framework that integrates both estimation and generation for 4D hand motion modeling, offering a fresh formulation of conditional motion synthesis that extends beyond task-specific designs. - The technical design, including the Joint VAE and Hand Perceptron modules, is well-motivated and validated through comprehensive experiments on multiple datasets, showing robustness under occlusion and dynamic camera motion. - The paper is clearly written and s
- Real-world deployment may be limited without efficient preprocessing of input modalities. - Heavy computational and data requirements for training.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Human Motion and Animation
