Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting   Transformers

Moritz Einfalt; Katja Ludwig; Rainer Lienhart

arXiv:2210.06110·cs.CV·October 24, 2022

Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting Transformers

Moritz Einfalt, Katja Ludwig, Rainer Lienhart

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces a Transformer-based method for 3D human pose estimation that efficiently handles sparse 2D inputs, enabling real-time dense 3D pose predictions with reduced computational cost.

Contribution

It proposes a novel Transformer approach utilizing masked token modeling for temporal upsampling, reducing complexity and enabling real-time inference on consumer hardware.

Findings

01

Achieves competitive MPJPE scores on benchmarks.

02

Reduces inference time by a factor of 12.

03

Enables real-time 3D pose estimation on standard hardware.

Abstract

The state-of-the-art for monocular 3D human pose estimation in videos is dominated by the paradigm of 2D-to-3D pose uplifting. While the uplifting methods themselves are rather efficient, the true computational complexity depends on the per-frame 2D pose estimation. In this paper, we present a Transformer-based pose uplifting scheme that can operate on temporally sparse 2D pose sequences but still produce temporally dense 3D pose estimates. We show how masked token modeling can be utilized for temporal upsampling within Transformer blocks. This allows to decouple the sampling rate of input 2D poses and the target frame rate of the video and drastically decreases the total computational complexity. Additionally, we explore the option of pre-training on large motion capture archives, which has been largely neglected so far. We evaluate our method on two popular benchmark datasets:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting Transformers· youtube

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Video Surveillance and Tracking Methods

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Position-Wise Feed-Forward Layer · Residual Connection · Dropout · Adam