Estimating Ego-Body Pose from Doubly Sparse Egocentric Video Data
Seunggeun Chi, Pin-Hao Huang, Enna Sachdeva, Hengbo Ma, Karthik, Ramani, and Kwonjoon Lee

TL;DR
This paper presents a novel two-stage method for estimating full-body pose from sparse egocentric video data, combining temporal imputation and spatial generation to improve accuracy and robustness.
Contribution
It introduces a two-stage approach using masked autoencoders and diffusion models to estimate full-body pose from sparse head and hand observations in egocentric videos.
Findings
Effective in estimating full-body pose from sparse data
Improves over naive diffusion model applications
Validated on multiple datasets with strong results
Abstract
We study the problem of estimating the body movements of a camera wearer from egocentric videos. Current methods for ego-body pose estimation rely on temporally dense sensor data, such as IMU measurements from spatially sparse body parts like the head and hands. However, we propose that even temporally sparse observations, such as hand poses captured intermittently from egocentric videos during natural or periodic hand movements, can effectively constrain overall body motion. Naively applying diffusion models to generate full-body pose from head pose and sparse hand pose leads to suboptimal results. To overcome this, we develop a two-stage approach that decomposes the problem into temporal completion and spatial completion. First, our method employs masked autoencoders to impute hand trajectories by leveraging the spatiotemporal correlations between the head pose sequence and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Virtual Reality Applications and Impacts
MethodsDiffusion
