Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation
Hongwei Fang, Jiahang Cai, Xun Wang, Wenwu Yang

TL;DR
This paper introduces TAR-ViTPose, a novel video-based human pose estimation method that leverages temporal aggregation and restoration in Vision Transformers to improve accuracy and stability over static image methods.
Contribution
The paper proposes a joint-centric temporal aggregation and global restoring attention mechanism to enhance ViT-based pose estimation in videos, addressing temporal coherence issues.
Findings
Achieves +2.3 mAP on PoseTrack2017 benchmark.
Outperforms existing state-of-the-art video pose methods.
Provides higher runtime frame rate in real-world scenarios.
Abstract
Vision Transformers (ViTs) have recently achieved state-of-the-art performance in 2D human pose estimation due to their strong global modeling capability. However, existing ViT-based pose estimators are designed for static images and process each frame independently, thereby ignoring the temporal coherence that exists in video sequences. This limitation often results in unstable predictions, especially in challenging scenes involving motion blur, occlusion, or defocus. In this paper, we propose TAR-ViTPose, a novel Temporal Aggregate-and-Restore Vision Transformer tailored for video-based 2D human pose estimation. TAR-ViTPose enhances static ViT representations by aggregating temporal cues across frames in a plug-and-play manner, leading to more robust and accurate pose estimation. To effectively aggregate joint-specific features that are temporally aligned across frames, we introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Advanced Vision and Imaging
