Occlusion-Aware 3D Hand-Object Pose Estimation with Masked AutoEncoders
Hui Yang, Wei Sun, Jian Liu, Jin Zheng, Jian Xiao, Ajmal Mian

TL;DR
This paper introduces HOMAE, a novel occlusion-aware method for 3D hand-object pose estimation using masked autoencoders, which effectively handles occlusions by learning context-aware features and combining implicit and explicit geometric representations.
Contribution
The paper proposes a target-focused masking strategy and a fusion of signed distance fields with point clouds to improve occlusion handling in hand-object pose estimation.
Findings
Achieves state-of-the-art results on DexYCB and HO3Dv2 benchmarks.
Effectively models occluded regions by combining global context and local geometry.
Demonstrates robustness in challenging occlusion scenarios.
Abstract
Hand-object pose estimation from monocular RGB images remains a significant challenge mainly due to the severe occlusions inherent in hand-object interactions. Existing methods do not sufficiently explore global structural perception and reasoning, which limits their effectiveness in handling occluded hand-object interactions. To address this challenge, we propose an occlusion-aware hand-object pose estimation method based on masked autoencoders, termed as HOMAE. Specifically, we propose a target-focused masking strategy that imposes structured occlusion on regions of hand-object interaction, encouraging the model to learn context-aware features and reason about the occluded structures. We further integrate multi-scale features extracted from the decoder to predict a signed distance field (SDF), capturing both global context and fine-grained geometry. To enhance geometric perception, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Hand Gesture Recognition Systems
