VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation
Jun Zhou, Chi Xu, Kaifeng Tang, Yuting Ge, Tingrui Guo, Li Cheng

TL;DR
This paper introduces VPHO, a framework that jointly learns visual and physical cues for more accurate and physically plausible 3D hand-object pose estimation from a single RGB image, improving over prior methods.
Contribution
The paper proposes a novel joint learning and aggregation approach that integrates visual and physical cues for enhanced hand-object pose estimation.
Findings
Outperforms state-of-the-art in pose accuracy
Produces more physically plausible results
Effective candidate pose refinement process
Abstract
Estimating the 3D poses of hands and objects from a single RGB image is a fundamental yet challenging problem, with broad applications in augmented reality and human-computer interaction. Existing methods largely rely on visual cues alone, often producing results that violate physical constraints such as interpenetration or non-contact. Recent efforts to incorporate physics reasoning typically depend on post-optimization or non-differentiable physics engines, which compromise visual consistency and end-to-end trainability. To overcome these limitations, we propose a novel framework that jointly integrates visual and physical cues for hand-object pose estimation. This integration is achieved through two key ideas: 1) joint visual-physical cue learning: The model is trained to extract 2D visual cues and 3D physical cues, thereby enabling more comprehensive representation learning for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Hand Gesture Recognition Systems
