CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware Prompting
Shaoxiang Guo, Qing Cai, Lin Qi, Junyu Dong

TL;DR
This paper introduces CLIP-Hand3D, a novel 3D hand pose estimation method that leverages CLIP's contrastive learning to connect text prompts with detailed 3D hand joint features from monocular images, achieving state-of-the-art results.
Contribution
It proposes a new approach that bridges text prompts with 3D hand pose features using contrastive learning, enabling faster and more accurate hand pose estimation.
Findings
Achieves state-of-the-art performance on public benchmarks.
Provides significantly faster inference speed.
Effectively encodes pose-aware features from spatial joint distributions.
Abstract
Contrastive Language-Image Pre-training (CLIP) starts to emerge in many computer vision tasks and has achieved promising performance. However, it remains underexplored whether CLIP can be generalized to 3D hand pose estimation, as bridging text prompts with pose-aware features presents significant challenges due to the discrete nature of joint positions in 3D space. In this paper, we make one of the first attempts to propose a novel 3D hand pose estimator from monocular images, dubbed as CLIP-Hand3D, which successfully bridges the gap between text prompts and irregular detailed pose distribution. In particular, the distribution order of hand joints in various 3D space directions is derived from pose labels, forming corresponding text prompts that are subsequently encoded into text representations. Simultaneously, 21 hand joints in the 3D space are retrieved, and their spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Multimodal Machine Learning Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Contrastive Learning · Contrastive Language-Image Pre-training
