Dual-stream Spatio-Temporal GCN-Transformer Network for 3D Human Pose Estimation

Jiawen Duan; Jian Xiang; Zhiqiang Li; Linlin Xue; Wan Xiang

arXiv:2604.17688·cs.CV·April 21, 2026

Dual-stream Spatio-Temporal GCN-Transformer Network for 3D Human Pose Estimation

Jiawen Duan, Jian Xiang, Zhiqiang Li, Linlin Xue, Wan Xiang

PDF

TL;DR

The paper introduces MixTGFormer, a dual-stream GCN-Transformer network that models both local and global spatio-temporal relationships for 3D human pose estimation, achieving state-of-the-art results.

Contribution

It proposes a novel dual-stream architecture combining GCN and Transformer with Mixformer blocks and SE layers for improved 3D pose estimation.

Findings

01

Achieved state-of-the-art P1 errors of 37.6mm on Human3.6M

02

Achieved state-of-the-art P1 errors of 15.7mm on MPI-INF-3DHP

03

Effectively fuses local skeletal and global features through dual streams.

Abstract

3D human pose estimation is a classic and important research direction in the field of computer vision. In recent years, Transformer-based methods have made significant progress in lifting 2D to 3D human pose estimation. However, these methods primarily focus on modeling global temporal and spatial relationships, neglecting local skeletal relationships and the information interaction between different channels. Therefore, we have proposed a novel method,the Dual-stream Spatio-temporal GCN-Transformer Network (MixTGFormer). This method models the spatial and temporal relationships of human skeletons simultaneously through two parallel channels, achieving effective fusion of global and local features. The core of MixTGFormer is composed of stacked Mixformers. Specifically, the Mixformer includes the Mixformer Block and the Squeeze-and-Excitation Layer ( SE Layer). It first extracts and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.