CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation
Mohammed Hassanin, Abdelwahed Khamiss, Mohammed Bennamoun, Farid, Boussaid, and Ibrahim Radwan

TL;DR
CrossFormer introduces a novel Transformer architecture with cross-joint and cross-frame interactions, improving local and global dependency modeling for 3D human pose estimation, achieving state-of-the-art results on key datasets.
Contribution
It proposes a new pose estimation Transformer with specialized interaction modules that enhance local and global joint dependencies, advancing the accuracy of 3D human pose estimation.
Findings
Achieved state-of-the-art performance on Human3.6 and MPI-INF-3DHP datasets.
Boosted performance by 0.9% and 0.3% over PoseFormer in different settings.
Effectively captures subtle changes across frames through novel interaction modules.
Abstract
3D human pose estimation can be handled by encoding the geometric dependencies between the body parts and enforcing the kinematic constraints. Recently, Transformer has been adopted to encode the long-range dependencies between the joints in the spatial and temporal domains. While they had shown excellence in long-range dependencies, studies have noted the need for improving the locality of vision Transformers. In this direction, we propose a novel pose estimation Transformer featuring rich representations of body joints critical for capturing subtle changes across frames (i.e., inter-feature representation). Specifically, through two novel interaction modules; Cross-Joint Interaction and Cross-Frame Interaction, the model explicitly encodes the local and global dependencies between the body joints. The proposed architecture achieved state-of-the-art performance on two popular 3D human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Hand Gesture Recognition Systems
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Dense Connections · Byte Pair Encoding · Label Smoothing · Absolute Position Encodings · Layer Normalization · Residual Connection
