PCIE_Pose Solution for EgoExo4D Pose and Proficiency Estimation Challenge
Feng Chen, Kanokphan Lertniphonphan, Qiancheng Yan, Xiaohui Fan, Jun Xie, Tao Zhang, Zhepeng Wang

TL;DR
This paper presents novel transformer-based solutions for egocentric hand and body pose estimation, achieving state-of-the-art results and winning championships in CVPR2025 challenges.
Contribution
Introduces HP-ViT+ architecture for hand pose estimation and a multimodal strategy for body pose, advancing egocentric pose estimation methods.
Findings
Achieved 8.31 PA-MPJPE in Hand Pose Challenge
Achieved 11.25 MPJPE in Body Pose Challenge
Top-1 accuracy of 0.53 in Proficiency Estimation
Abstract
This report introduces our team's (PCIE_EgoPose) solutions for the EgoExo4D Pose and Proficiency Estimation Challenges at CVPR2025. Focused on the intricate task of estimating 21 3D hand joints from RGB egocentric videos, which are complicated by subtle movements and frequent occlusions, we developed the Hand Pose Vision Transformer (HP-ViT+). This architecture synergizes a Vision Transformer and a CNN backbone, using weighted fusion to refine the hand pose predictions. For the EgoExo4D Body Pose Challenge, we adopted a multimodal spatio-temporal feature integration strategy to address the complexities of body pose estimation across dynamic contexts. Our methods achieved remarkable performance: 8.31 PA-MPJPE in the Hand Pose Challenge and 11.25 MPJPE in the Body Pose Challenge, securing championship titles in both competitions. We extended our pose estimation solutions to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotic Mechanisms and Dynamics · Robot Manipulation and Learning · Hand Gesture Recognition Systems
MethodsAttention Is All You Need · Linear Layer · Adam · Dense Connections · Vision Transformer · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention
