VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose   Estimation

Yuxing Chen; Renshu Gu; Ouhan Huang; Gangyong Jia

arXiv:2205.12602·cs.CV·August 7, 2023·1 cites

VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation

Yuxing Chen, Renshu Gu, Ouhan Huang, Gangyong Jia

PDF

Open Access

TL;DR

This paper introduces VTP, a novel volumetric transformer framework for multi-view multi-person 3D human pose estimation that effectively combines 3D volumetric features with transformer models, achieving promising results.

Contribution

VTP is the first to integrate volumetric transformers with multi-view 3D pose estimation, utilizing sparse Sinkhorn attention to reduce memory costs and improve performance.

Findings

01

Achieved state-of-the-art results on multiple benchmarks.

02

Reduced memory usage with sparse Sinkhorn attention.

03

Demonstrated effective 3D pose estimation with transformer-based volumetric features.

Abstract

This paper presents Volumetric Transformer Pose estimator (VTP), the first 3D volumetric transformer framework for multi-view multi-person 3D human pose estimation. VTP aggregates features from 2D keypoints in all camera views and directly learns the spatial relationships in the 3D voxel space in an end-to-end fashion. The aggregated 3D features are passed through 3D convolutions before being flattened into sequential embeddings and fed into a transformer. A residual structure is designed to further improve the performance. In addition, the sparse Sinkhorn attention is empowered to reduce the memory cost, which is a major bottleneck for volumetric representations, while also achieving excellent performance. The output of the transformer is again concatenated with 3D convolutional features by a residual design. The proposed VTP framework integrates the high performance of the transformer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Diabetic Foot Ulcer Assessment and Management · Video Surveillance and Tracking Methods

MethodsMulti-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Softmax · Layer Normalization · Byte Pair Encoding · Dense Connections · Dropout · Absolute Position Encodings