VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation
Yuxing Chen, Renshu Gu, Ouhan Huang, Gangyong Jia

TL;DR
This paper introduces VTP, a novel volumetric transformer framework for multi-view multi-person 3D human pose estimation that effectively combines 3D volumetric features with transformer models, achieving promising results.
Contribution
VTP is the first to integrate volumetric transformers with multi-view 3D pose estimation, utilizing sparse Sinkhorn attention to reduce memory costs and improve performance.
Findings
Achieved state-of-the-art results on multiple benchmarks.
Reduced memory usage with sparse Sinkhorn attention.
Demonstrated effective 3D pose estimation with transformer-based volumetric features.
Abstract
This paper presents Volumetric Transformer Pose estimator (VTP), the first 3D volumetric transformer framework for multi-view multi-person 3D human pose estimation. VTP aggregates features from 2D keypoints in all camera views and directly learns the spatial relationships in the 3D voxel space in an end-to-end fashion. The aggregated 3D features are passed through 3D convolutions before being flattened into sequential embeddings and fed into a transformer. A residual structure is designed to further improve the performance. In addition, the sparse Sinkhorn attention is empowered to reduce the memory cost, which is a major bottleneck for volumetric representations, while also achieving excellent performance. The output of the transformer is again concatenated with 3D convolutional features by a residual design. The proposed VTP framework integrates the high performance of the transformer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Diabetic Foot Ulcer Assessment and Management · Video Surveillance and Tracking Methods
MethodsMulti-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Softmax · Layer Normalization · Byte Pair Encoding · Dense Connections · Dropout · Absolute Position Encodings
