VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation

Jun Zhou; Chi Xu; Kaifeng Tang; Yuting Ge; Tingrui Guo; Li Cheng

arXiv:2511.12030·cs.CV·November 18, 2025

VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation

Jun Zhou, Chi Xu, Kaifeng Tang, Yuting Ge, Tingrui Guo, Li Cheng

PDF

Open Access 1 Video

TL;DR

This paper introduces VPHO, a framework that jointly learns visual and physical cues for more accurate and physically plausible 3D hand-object pose estimation from a single RGB image, improving over prior methods.

Contribution

The paper proposes a novel joint learning and aggregation approach that integrates visual and physical cues for enhanced hand-object pose estimation.

Findings

01

Outperforms state-of-the-art in pose accuracy

02

Produces more physically plausible results

03

Effective candidate pose refinement process

Abstract

Estimating the 3D poses of hands and objects from a single RGB image is a fundamental yet challenging problem, with broad applications in augmented reality and human-computer interaction. Existing methods largely rely on visual cues alone, often producing results that violate physical constraints such as interpenetration or non-contact. Recent efforts to incorporate physics reasoning typically depend on post-optimization or non-differentiable physics engines, which compromise visual consistency and end-to-end trainability. To overcome these limitations, we propose a novel framework that jointly integrates visual and physical cues for hand-object pose estimation. This integration is achieved through two key ideas: 1) joint visual-physical cue learning: The model is trained to extract 2D visual cues and 3D physical cues, thereby enabling more comprehensive representation learning for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation· underline

Taxonomy

TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Hand Gesture Recognition Systems