End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer

Yonghui Yu; Jiahang Cai; Xun Wang; and Wenwu Yang

arXiv:2511.13208·cs.CV·December 3, 2025

End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer

Yonghui Yu, Jiahang Cai, Xun Wang, and Wenwu Yang

PDF

Open Access 1 Video

TL;DR

This paper introduces PAVE-Net, an end-to-end video transformer framework for multi-person pose estimation that eliminates heuristic steps and improves accuracy and efficiency in associating individuals across video frames.

Contribution

The paper proposes the first end-to-end multi-frame 2D human pose estimation method using a novel pose-aware attention mechanism within a video transformer architecture.

Findings

01

Achieves 6.0 mAP improvement on PoseTrack2017.

02

Outperforms prior image-based end-to-end methods.

03

Offers accuracy comparable to state-of-the-art two-stage approaches with better efficiency.

Abstract

Existing multi-person video pose estimation methods typically adopt a two-stage pipeline: detecting individuals in each frame, followed by temporal modeling for single person pose estimation. This design relies on heuristic operations such as detection, RoI cropping, and non-maximum suppression (NMS), limiting both accuracy and efficiency. In this paper, we present a fully end-to-end framework for multi-person 2D pose estimation in videos, effectively eliminating heuristic operations. A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories. To address this, we introduce a novel Pose-Aware Video transformEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and a spatiotemporal pose decoder to capture global dependencies across frames. To achieve accurate temporal association, we propose a pose-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer· underline

Taxonomy

TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Human Motion and Animation