IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation
Zhongwei Qiu, Qiansheng Yang, Jian Wang, Dongmei Fu

TL;DR
This paper introduces IVT, an end-to-end video transformer that directly predicts 3D human poses from video frames by learning spatiotemporal depth information through instance-guided tokens and attention mechanisms.
Contribution
The novel end-to-end framework effectively captures depth context and handles multiple persons with a cross-scale attention mechanism, advancing 3D pose estimation accuracy.
Findings
Achieves state-of-the-art results on three benchmarks.
Effectively models spatiotemporal depth information.
Handles multiple persons with cross-scale attention.
Abstract
Video 3D human pose estimation aims to localize the 3D coordinates of human joints from videos. Recent transformer-based approaches focus on capturing the spatiotemporal information from sequential 2D poses, which cannot model the contextual depth feature effectively since the visual depth features are lost in the step of 2D pose estimation. In this paper, we simplify the paradigm into an end-to-end framework, Instance-guided Video Transformer (IVT), which enables learning spatiotemporal contextual depth information from visual features effectively and predicts 3D poses directly from video frames. In particular, we firstly formulate video frames as a series of instance-guided tokens and each token is in charge of predicting the 3D pose of a human instance. These tokens contain body structure information since they are extracted by the guidance of joint offsets from the human center to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Anomaly Detection Techniques and Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Adam · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization
