Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling   Capacities for Efficient 3D Human Pose Estimation

Mengmeng Cui; Kunbo Zhang; Zhenan Sun

arXiv:2407.02990·cs.CV·October 23, 2024

Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation

Mengmeng Cui, Kunbo Zhang, Zhenan Sun

PDF

Open Access

TL;DR

This paper introduces G-SFormer, a novel efficient 3D human pose estimation model that leverages a graph and skipped transformer architecture to capture spatio-temporal features with reduced redundancy and computational cost.

Contribution

It proposes a global spatio-temporal modeling approach with a data-driven adaptive topology and a skipped transformer, improving efficiency and robustness over existing methods.

Findings

01

Achieves superior accuracy on multiple benchmarks.

02

Uses only around 10% of parameters compared to previous methods.

03

Demonstrates robustness to 2D pose detection errors.

Abstract

In recent years, 2D-to-3D pose uplifting in monocular 3D Human Pose Estimation (HPE) has attracted widespread research interest. GNN-based methods and Transformer-based methods have become mainstream architectures due to their advanced spatial and temporal feature learning capacities. However, existing approaches typically construct joint-wise and frame-wise attention alignments in spatial and temporal domains, resulting in dense connections that introduce considerable local redundancy and computational overhead. In this paper, we take a global approach to exploit spatio-temporal information and realise efficient 3D HPE with a concise Graph and Skipped Transformer architecture. Specifically, in Spatial Encoding stage, coarse-grained body parts are deployed to construct Spatial Graph Network with a fully data-driven adaptive topology, ensuring model flexibility and generalizability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Gait Recognition and Analysis · Video Surveillance and Tracking Methods

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Adam · Dense Connections