End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer
Yonghui Yu, Jiahang Cai, Xun Wang, and Wenwu Yang

TL;DR
This paper introduces PAVE-Net, an end-to-end video transformer framework for multi-person pose estimation that eliminates heuristic steps and improves accuracy and efficiency in associating individuals across video frames.
Contribution
The paper proposes the first end-to-end multi-frame 2D human pose estimation method using a novel pose-aware attention mechanism within a video transformer architecture.
Findings
Achieves 6.0 mAP improvement on PoseTrack2017.
Outperforms prior image-based end-to-end methods.
Offers accuracy comparable to state-of-the-art two-stage approaches with better efficiency.
Abstract
Existing multi-person video pose estimation methods typically adopt a two-stage pipeline: detecting individuals in each frame, followed by temporal modeling for single person pose estimation. This design relies on heuristic operations such as detection, RoI cropping, and non-maximum suppression (NMS), limiting both accuracy and efficiency. In this paper, we present a fully end-to-end framework for multi-person 2D pose estimation in videos, effectively eliminating heuristic operations. A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories. To address this, we introduce a novel Pose-Aware Video transformEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and a spatiotemporal pose decoder to capture global dependencies across frames. To achieve accurate temporal association, we propose a pose-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Human Motion and Animation
