CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware   Prompting

Shaoxiang Guo; Qing Cai; Lin Qi; Junyu Dong

arXiv:2309.16140·cs.MM·September 29, 2023·1 cites

CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware Prompting

Shaoxiang Guo, Qing Cai, Lin Qi, Junyu Dong

PDF

Open Access

TL;DR

This paper introduces CLIP-Hand3D, a novel 3D hand pose estimation method that leverages CLIP's contrastive learning to connect text prompts with detailed 3D hand joint features from monocular images, achieving state-of-the-art results.

Contribution

It proposes a new approach that bridges text prompts with 3D hand pose features using contrastive learning, enabling faster and more accurate hand pose estimation.

Findings

01

Achieves state-of-the-art performance on public benchmarks.

02

Provides significantly faster inference speed.

03

Effectively encodes pose-aware features from spatial joint distributions.

Abstract

Contrastive Language-Image Pre-training (CLIP) starts to emerge in many computer vision tasks and has achieved promising performance. However, it remains underexplored whether CLIP can be generalized to 3D hand pose estimation, as bridging text prompts with pose-aware features presents significant challenges due to the discrete nature of joint positions in 3D space. In this paper, we make one of the first attempts to propose a novel 3D hand pose estimator from monocular images, dubbed as CLIP-Hand3D, which successfully bridges the gap between text prompts and irregular detailed pose distribution. In particular, the distribution order of hand joints in various 3D space directions is derived from pose labels, forming corresponding text prompts that are subsequently encoded into text representations. Simultaneously, 21 hand joints in the 3D space are retrieved, and their spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Multimodal Machine Learning Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Contrastive Learning · Contrastive Language-Image Pre-training