SiMHand: Mining Similar Hands for Large-Scale 3D Hand Pose Pre-training
Nie Lin, Takehiko Ohkawa, Yifei Huang, Mingfang Zhang, Minjie Cai,, Ming Li, Ryosuke Furuta, Yoichi Sato

TL;DR
SiMHand introduces a contrastive pre-training framework for 3D hand pose estimation using a large-scale dataset of in-the-wild hand images, improving accuracy across multiple benchmarks.
Contribution
The paper proposes a novel contrastive learning method that leverages similar hand pairs from diverse in-the-wild images for large-scale pre-training of 3D hand pose models.
Findings
Outperforms existing contrastive learning methods in hand pose tasks.
Achieves 15% improvement on FreiHand dataset.
Demonstrates significant gains on DexYCB and AssemblyHands datasets.
Abstract
We present a framework for pre-training of 3D hand pose estimation from in-the-wild hand images sharing with similar hand characteristics, dubbed SimHand. Pre-training with large-scale images achieves promising results in various tasks, but prior methods for 3D hand pose pre-training have not fully utilized the potential of diverse hand images accessible from in-the-wild videos. To facilitate scalable pre-training, we first prepare an extensive pool of hand images from in-the-wild videos and design our pre-training method with contrastive learning. Specifically, we collect over 2.0M hand images from recent human-centric videos, such as 100DOH and Ego4D. To extract discriminative information from these images, we focus on the similarity of hands: pairs of non-identical samples with similar hand poses. We then propose a novel contrastive learning method that embeds similar hand pairs…
Peer Reviews
Decision·ICLR 2025 Poster
- The paper is well written and easy to follow. - The motivation of finding similar hands derived from different video domains is technically sound, which can further benefit contrastive learning process from discriminating foreground hands in varying backgrounds. - The experimental results in Table 3 demonstrate the generality of the proposed contrastive learning with adaptive weighting mechanism.
- TempCLR [1] proposes a pre-train framework for 3D hand reconstruction with time-coherent contrastive learning, and shows better performance compared with PeCLR. Although TempCLR focuses on reconstruction tasks, the used parametric model can output 3D pose results. Therefore, more comparisons with TempCLR would be helpful. - In the second column of Figure 6, HandCLR demonstrates advanced performance in hand-object occlusion. Does the proposed method exhibit robustness in similar severe occlusi
- The paper is well-written and easy to follow. - The motivation behind the proposed method is sound, with comprehensive details from data preparation to training. - The design of contrastive loss with weighting provides better gradient guidance for samples with different sources and similarities, which is both reasonable and effective. - The numerous experiments reflect significant effort by the authors. - The experimental section is logical and thorough, demonstrating performance improvements
- Some presentation issues need improvement - Figure 6 should be updated to remove inappropriate "bbox" spelling marks. Additionally, all images in the paper should be replaced with vector versions to prevent blurry text, as seen in Figure 3. - The article lacks references and discussions on self-supervised methods. The recent two works, S2Hand and HaMuCo, although not pre-training methods, also attempt to use unlabeled images and 2D off-the-shelf detectors to train 3D hand pose estimation m
- The paper empirically verifies the improvement over prior self-supervised models. Self-supervision is a rather underexplored area in hand pose estimation and can lead to potentially great benefit as foundation models. - The improvements are substantial - The paper is easy to understand
- The method shows great improvement over prior self-supervised methods through the use of noisy 2D annotations. However, its use is a rather involved process: it first needs to be embedded, then used during pre-training before then performing supervised fine-tuning. Instead, why not just use the noisy 2D annotations directly as a form of weak-supervision? In fact, this has been done in prior work [1] and has lead to substantial improvements. In order to properly verify the usefulness of the aut
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Human Motion and Animation
MethodsContrastive Learning · Focus
