Dual-stream Spatio-Temporal GCN-Transformer Network for 3D Human Pose Estimation
Jiawen Duan, Jian Xiang, Zhiqiang Li, Linlin Xue, Wan Xiang

TL;DR
The paper introduces MixTGFormer, a dual-stream GCN-Transformer network that models both local and global spatio-temporal relationships for 3D human pose estimation, achieving state-of-the-art results.
Contribution
It proposes a novel dual-stream architecture combining GCN and Transformer with Mixformer blocks and SE layers for improved 3D pose estimation.
Findings
Achieved state-of-the-art P1 errors of 37.6mm on Human3.6M
Achieved state-of-the-art P1 errors of 15.7mm on MPI-INF-3DHP
Effectively fuses local skeletal and global features through dual streams.
Abstract
3D human pose estimation is a classic and important research direction in the field of computer vision. In recent years, Transformer-based methods have made significant progress in lifting 2D to 3D human pose estimation. However, these methods primarily focus on modeling global temporal and spatial relationships, neglecting local skeletal relationships and the information interaction between different channels. Therefore, we have proposed a novel method,the Dual-stream Spatio-temporal GCN-Transformer Network (MixTGFormer). This method models the spatial and temporal relationships of human skeletons simultaneously through two parallel channels, achieving effective fusion of global and local features. The core of MixTGFormer is composed of stacked Mixformers. Specifically, the Mixformer includes the Mixformer Block and the Squeeze-and-Excitation Layer ( SE Layer). It first extracts and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
