Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation
Jinman Park, Kimathi Kaai, Saad Hossain, Norikatsu Sumi, Sirisha, Rambhatla, Paul Fieguth

TL;DR
This paper introduces Ego-STAN, a spatio-temporal Transformer model that leverages past frames and feature map tokens to improve egocentric 3D human pose estimation, significantly reducing error and model size.
Contribution
It presents a novel spatio-temporal Transformer architecture with feature map tokens for egocentric 3D pose estimation, addressing self-occlusion and distortion challenges.
Findings
30.6% improvement in mean per-joint position error
22% reduction in model parameters
Superior performance on xR-EgoPose dataset
Abstract
Egocentric 3D human pose estimation (HPE) from images is challenging due to severe self-occlusions and strong distortion introduced by the fish-eye view from the head mounted camera. Although existing works use intermediate heatmap-based representations to counter distortion with some success, addressing self-occlusion remains an open problem. In this work, we leverage information from past frames to guide our self-attention-based 3D HPE estimation procedure -- Ego-STAN. Specifically, we build a spatio-temporal Transformer model that attends to semantically rich convolutional neural network-based feature maps. We also propose feature map tokens: a new set of learnable parameters to attend to these feature maps. Finally, we demonstrate Ego-STAN's superior performance on the xR-EgoPose dataset where it achieves a 30.6% improvement on the overall mean per-joint position error, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Diabetic Foot Ulcer Assessment and Management
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Label Smoothing · Softmax · Absolute Position Encodings · Dropout · Adam · Byte Pair Encoding
