Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation
Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Jialun Cai, Nicu Sebe

TL;DR
The paper introduces the Hourglass Tokenizer (HoT), a framework that prunes and recovers pose tokens in transformer models to significantly improve efficiency in 3D human pose estimation from videos while maintaining high accuracy.
Contribution
The novel HoT framework employs dynamic token pruning and recovering mechanisms to reduce computational costs in transformer-based video pose estimation models.
Findings
HoT saves nearly 50% FLOPs on MotionBERT without accuracy loss.
HoT reduces FLOPs by about 40% with only 0.2% accuracy drop on MixSTE.
Experiments on Human3.6M and MPI-INF-3DHP demonstrate high efficiency and accuracy.
Abstract
Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D human pose estimation from videos. Our HoT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. To effectively achieve this, we propose a token pruning cluster (TPC) that dynamically selects a few representative tokens with high semantic diversity while eliminating the redundancy of video frames. In addition, we develop a token recovering attention (TRA) to restore…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Video Surveillance and Tracking Methods
MethodsPruning
