Adaptive Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation
Hui Shuai, Lele Wu, and Qingshan Liu

TL;DR
This paper introduces the MTF-Transformer, a unified transformer-based framework that adaptively fuses multi-view and temporal information for 3D human pose estimation without camera calibration, demonstrating competitive results across multiple datasets.
Contribution
It presents a novel adaptive multi-view and temporal fusion transformer architecture that handles varying views and video lengths without camera calibration, improving robustness and generalization.
Findings
Achieves competitive results on Human3.6M, TotalCapture, and KTH datasets.
Effectively handles arbitrary view numbers and video lengths.
Demonstrates strong generalization to unseen views and dynamic scenarios.
Abstract
This paper proposes a unified framework dubbed Multi-view and Temporal Fusing Transformer (MTF-Transformer) to adaptively handle varying view numbers and video length without camera calibration in 3D Human Pose Estimation (HPE). It consists of Feature Extractor, Multi-view Fusing Transformer (MFT), and Temporal Fusing Transformer (TFT). Feature Extractor estimates 2D pose from each image and fuses the prediction according to the confidence. It provides pose-focused feature embedding and makes subsequent modules computationally lightweight. MFT fuses the features of a varying number of views with a novel Relative-Attention block. It adaptively measures the implicit relative relationship between each pair of views and reconstructs more informative features. TFT aggregates the features of the whole sequence and predicts 3D pose via a transformer. It adaptively deals with the video of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Absolute Position Encodings · Adam · Softmax · Dropout · Dense Connections · Layer Normalization
