Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting Transformers
Moritz Einfalt, Katja Ludwig, Rainer Lienhart

TL;DR
This paper introduces a Transformer-based method for 3D human pose estimation that efficiently handles sparse 2D inputs, enabling real-time dense 3D pose predictions with reduced computational cost.
Contribution
It proposes a novel Transformer approach utilizing masked token modeling for temporal upsampling, reducing complexity and enabling real-time inference on consumer hardware.
Findings
Achieves competitive MPJPE scores on benchmarks.
Reduces inference time by a factor of 12.
Enables real-time 3D pose estimation on standard hardware.
Abstract
The state-of-the-art for monocular 3D human pose estimation in videos is dominated by the paradigm of 2D-to-3D pose uplifting. While the uplifting methods themselves are rather efficient, the true computational complexity depends on the per-frame 2D pose estimation. In this paper, we present a Transformer-based pose uplifting scheme that can operate on temporally sparse 2D pose sequences but still produce temporally dense 3D pose estimates. We show how masked token modeling can be utilized for temporal upsampling within Transformer blocks. This allows to decouple the sampling rate of input 2D poses and the target frame rate of the video and drastically decreases the total computational complexity. Additionally, we explore the option of pre-training on large motion capture archives, which has been largely neglected so far. We evaluate our method on two popular benchmark datasets:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting Transformers· youtube
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Video Surveillance and Tracking Methods
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Position-Wise Feed-Forward Layer · Residual Connection · Dropout · Adam
