Towards Egocentric 3D Hand Pose Estimation in Unseen Domains
Wiktor Mucha, Michael Wray, Martin Kampel

TL;DR
This paper introduces V-HPOT, a camera-agnostic, self-supervised approach for egocentric 3D hand pose estimation that significantly improves cross-domain performance without extensive training data.
Contribution
V-HPOT's key innovation is estimating normalized keypoint depth and applying self-supervised test-time optimization, enabling robust cross-domain hand pose estimation.
Findings
Achieves 71% reduction in mean pose error on H2O dataset.
Achieves 41% reduction in mean pose error on AssemblyHands dataset.
Outperforms all single-stage methods and rivals two-stage approaches with less data.
Abstract
We present V-HPOT, a novel approach for improving the cross-domain performance of 3D hand pose estimation from egocentric images across diverse, unseen domains. State-of-the-art methods demonstrate strong performance when trained and tested within the same domain. However, they struggle to generalise to new environments due to limited training data and depth perception -- overfitting to specific camera intrinsics. Our method addresses this by estimating keypoint z-coordinates in a virtual camera space, normalised by focal length and image size, enabling camera-agnostic depth prediction. We further leverage this invariance to camera intrinsics to propose a self-supervised test-time optimisation strategy that refines the model's depth perception during inference. This is achieved by applying a 3D consistency loss between predicted and in-space scale-transformed hand poses, allowing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Hand Gesture Recognition Systems
