CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose   Estimation

Mohammed Hassanin; Abdelwahed Khamiss; Mohammed Bennamoun; Farid; Boussaid; and Ibrahim Radwan

arXiv:2203.13387·cs.CV·March 28, 2022·1 cites

CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation

Mohammed Hassanin, Abdelwahed Khamiss, Mohammed Bennamoun, Farid, Boussaid, and Ibrahim Radwan

PDF

Open Access 1 Repo

TL;DR

CrossFormer introduces a novel Transformer architecture with cross-joint and cross-frame interactions, improving local and global dependency modeling for 3D human pose estimation, achieving state-of-the-art results on key datasets.

Contribution

It proposes a new pose estimation Transformer with specialized interaction modules that enhance local and global joint dependencies, advancing the accuracy of 3D human pose estimation.

Findings

01

Achieved state-of-the-art performance on Human3.6 and MPI-INF-3DHP datasets.

02

Boosted performance by 0.9% and 0.3% over PoseFormer in different settings.

03

Effectively captures subtle changes across frames through novel interaction modules.

Abstract

3D human pose estimation can be handled by encoding the geometric dependencies between the body parts and enforcing the kinematic constraints. Recently, Transformer has been adopted to encode the long-range dependencies between the joints in the spatial and temporal domains. While they had shown excellence in long-range dependencies, studies have noted the need for improving the locality of vision Transformers. In this direction, we propose a novel pose estimation Transformer featuring rich representations of body joints critical for capturing subtle changes across frames (i.e., inter-feature representation). Specifically, through two novel interaction modules; Cross-Joint Interaction and Cross-Frame Interaction, the model explicitly encodes the local and global dependencies between the body joints. The proposed architecture achieved state-of-the-art performance on two popular 3D human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mfawzy/CrossFormer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Hand Gesture Recognition Systems

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Dense Connections · Byte Pair Encoding · Label Smoothing · Absolute Position Encodings · Layer Normalization · Residual Connection