GenHOI: Generalized Hand-Object Pose Estimation with Occlusion Awareness
Hui Yang, Wei Sun, Jian Liu, Jian Xiao, Tao Xie, Hossein Rahmani, Ajmal Saeed Mian, Nicu Sebe, Gim Hee Lee

TL;DR
GenHOI is a novel framework that improves 3D hand-object pose estimation from RGB images by incorporating hierarchical semantic prompts, multi-modal masked modeling, and hand priors to handle occlusions and unseen interactions.
Contribution
It introduces a hierarchical semantic prompt and multi-modal masked modeling strategy for robust, generalized hand-object pose estimation under occlusion and unseen scenarios.
Findings
Achieves state-of-the-art results on DexYCB and HO3Dv2 benchmarks.
Effectively handles occlusion and unseen objects in pose estimation.
Utilizes multi-modal data and hierarchical prompts for improved generalization.
Abstract
Generalized 3D hand-object pose estimation from a single RGB image remains challenging due to the large variations in object appearances and interaction patterns, especially under heavy occlusion. We propose GenHOI, a framework for generalized hand-object pose estimation with occlusion awareness. GenHOI integrates hierarchical semantic knowledge with hand priors to enhance model generalization under challenging occlusion conditions. Specifically, we introduce a hierarchical semantic prompt that encodes object states, hand configurations, and interaction patterns via textual descriptions. This enables the model to learn abstract high-level representations of hand-object interactions for generalization to unseen objects and novel interactions while compensating for missing or ambiguous visual cues. To enable robust occlusion reasoning, we adopt a multi-modal masked modeling strategy over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
