AG-EgoPose: Leveraging Action-Guided Motion and Kinematic Joint Encoding for Egocentric 3D Pose Estimation

Md Mushfiqur Azam; John Quarles; Kevin Desai

arXiv:2603.25175·cs.CV·March 27, 2026

AG-EgoPose: Leveraging Action-Guided Motion and Kinematic Joint Encoding for Egocentric 3D Pose Estimation

Md Mushfiqur Azam, John Quarles, Kevin Desai

PDF

Open Access

TL;DR

AG-EgoPose introduces a dual-stream egocentric 3D pose estimation framework that effectively combines short- and long-range motion context with spatial cues using transformer-based fusion, achieving state-of-the-art results.

Contribution

The paper presents a novel dual-stream architecture that integrates spatial and temporal information with transformer-based joint-level fusion for egocentric 3D pose estimation.

Findings

01

Achieves state-of-the-art performance on real-world datasets.

02

Effectively leverages motion context in egocentric videos.

03

Outperforms existing methods in quantitative and qualitative metrics.

Abstract

Egocentric 3D human pose estimation remains challenging due to severe perspective distortion, limited body visibility, and complex camera motion inherent in first-person viewpoints. Existing methods typically rely on single-frame analysis or limited temporal fusion, which fails to effectively leverage the rich motion context available in egocentric videos. We introduce AG-EgoPose, a novel dual-stream framework that integrates short- and long-range motion context with fine-grained spatial cues for robust pose estimation from fisheye camera input. Our framework features two parallel streams: A spatial stream uses a weight-sharing ResNet-18 encoder-decoder to generate 2D joint heatmaps and corresponding joint-specific spatial feature tokens. Simultaneously, a temporal stream uses a ResNet-50 backbone to extract visual features, which are then processed by an action recognition backbone to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Robot Manipulation and Learning