Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation

Hongwei Fang; Jiahang Cai; Xun Wang; Wenwu Yang

arXiv:2603.05929·cs.CV·March 9, 2026

Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation

Hongwei Fang, Jiahang Cai, Xun Wang, Wenwu Yang

PDF

Open Access

TL;DR

This paper introduces TAR-ViTPose, a novel video-based human pose estimation method that leverages temporal aggregation and restoration in Vision Transformers to improve accuracy and stability over static image methods.

Contribution

The paper proposes a joint-centric temporal aggregation and global restoring attention mechanism to enhance ViT-based pose estimation in videos, addressing temporal coherence issues.

Findings

01

Achieves +2.3 mAP on PoseTrack2017 benchmark.

02

Outperforms existing state-of-the-art video pose methods.

03

Provides higher runtime frame rate in real-world scenarios.

Abstract

Vision Transformers (ViTs) have recently achieved state-of-the-art performance in 2D human pose estimation due to their strong global modeling capability. However, existing ViT-based pose estimators are designed for static images and process each frame independently, thereby ignoring the temporal coherence that exists in video sequences. This limitation often results in unstable predictions, especially in challenging scenes involving motion blur, occlusion, or defocus. In this paper, we propose TAR-ViTPose, a novel Temporal Aggregate-and-Restore Vision Transformer tailored for video-based 2D human pose estimation. TAR-ViTPose enhances static ViT representations by aggregating temporal cues across frames in a plug-and-play manner, leading to more robust and accurate pose estimation. To effectively aggregate joint-specific features that are temporally aligned across frames, we introduce a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Advanced Vision and Imaging