EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR
Zhenyu Li, Sai Kumar Dwivedi, Filip Maric, Carlos Chacon, Nadine Bertsch, Filippo Arcadu, Tomas Hodan, Michael Ramamonjisoa, Peter Wonka, Amy Zhao, Robin Kips, Cem Keskin, Anastasia Tkach, Chenhongyi Yang

TL;DR
EgoPoseFormer v2 introduces a transformer-based egocentric human motion estimation method with an auto-labeling system, achieving high accuracy and temporal consistency for AR/VR applications using large unlabeled datasets.
Contribution
The paper presents a novel transformer-based model with auto-labeling for scalable, accurate egocentric human motion estimation in AR/VR.
Findings
Outperforms state-of-the-art by 12.2% and 19.4% in accuracy
Reduces temporal jitter by over 50%
Auto-labeling improves wrist MPJPE by 13.1%
Abstract
Egocentric human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present EgoPoseFormer v2, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training. Our model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget. The auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher-student…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · 3D Shape Modeling and Analysis
