Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation

Jinman Park; Kimathi Kaai; Saad Hossain; Norikatsu Sumi; Sirisha; Rambhatla; Paul Fieguth

arXiv:2206.04785·cs.CV·June 13, 2022

Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation

Jinman Park, Kimathi Kaai, Saad Hossain, Norikatsu Sumi, Sirisha, Rambhatla, Paul Fieguth

PDF

Open Access

TL;DR

This paper introduces Ego-STAN, a spatio-temporal Transformer model that leverages past frames and feature map tokens to improve egocentric 3D human pose estimation, significantly reducing error and model size.

Contribution

It presents a novel spatio-temporal Transformer architecture with feature map tokens for egocentric 3D pose estimation, addressing self-occlusion and distortion challenges.

Findings

01

30.6% improvement in mean per-joint position error

02

22% reduction in model parameters

03

Superior performance on xR-EgoPose dataset

Abstract

Egocentric 3D human pose estimation (HPE) from images is challenging due to severe self-occlusions and strong distortion introduced by the fish-eye view from the head mounted camera. Although existing works use intermediate heatmap-based representations to counter distortion with some success, addressing self-occlusion remains an open problem. In this work, we leverage information from past frames to guide our self-attention-based 3D HPE estimation procedure -- Ego-STAN. Specifically, we build a spatio-temporal Transformer model that attends to semantically rich convolutional neural network-based feature maps. We also propose feature map tokens: a new set of learnable parameters to attend to these feature maps. Finally, we demonstrate Ego-STAN's superior performance on the xR-EgoPose dataset where it achieves a 30.6% improvement on the overall mean per-joint position error, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Diabetic Foot Ulcer Assessment and Management

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Label Smoothing · Softmax · Absolute Position Encodings · Dropout · Adam · Byte Pair Encoding