Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation
Ziniu Wan, Zhengjia Li, Maoqing Tian, Jianbo Liu, Shuai Yi, Hongsheng, Li

TL;DR
This paper introduces a Multi-level Attention Encoder-Decoder Network (MAED) that effectively models spatial, temporal, and joint-level relations to improve 3D human shape and pose estimation, especially in challenging scenarios.
Contribution
The novel MAED framework integrates multi-level attention mechanisms within an encoder-decoder architecture for enhanced 3D human pose estimation.
Findings
Outperforms previous methods on 3DPW, MPI-INF-3DHP, and Human3.6M benchmarks.
Achieves 6.2, 7.2, and 2.4 mm improvements in PA-MPJPE.
Effectively models multi-level relations for accurate pose estimation.
Abstract
3D human shape and pose estimation is the essential task for human motion analysis, which is widely used in many 3D applications. However, existing methods cannot simultaneously capture the relations at multiple levels, including spatial-temporal level and human joint level. Therefore they fail to make accurate predictions in some hard scenarios when there is cluttered background, occlusion, or extreme pose. To this end, we propose Multi-level Attention Encoder-Decoder Network (MAED), including a Spatial-Temporal Encoder (STE) and a Kinematic Topology Decoder (KTD) to model multi-level attentions in a unified framework. STE consists of a series of cascaded blocks based on Multi-Head Self-Attention, and each block uses two parallel branches to learn spatial and temporal attention respectively. Meanwhile, KTD aims at modeling the joint level attention. It regards pose estimation as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Video Surveillance and Tracking Methods
