HRPVT: High-Resolution Pyramid Vision Transformer for medium and   small-scale human pose estimation

Zhoujie Xu

arXiv:2410.22079·cs.CV·December 17, 2024

HRPVT: High-Resolution Pyramid Vision Transformer for medium and small-scale human pose estimation

Zhoujie Xu

PDF

Open Access

TL;DR

HRPVT introduces a novel high-resolution pyramid module combined with a transformer backbone and a new keypoint representation method, significantly improving medium and small-scale human pose estimation accuracy and efficiency.

Contribution

The paper presents HRPVT, a transformer-based model with a high-resolution pyramid module and a new keypoint representation, advancing pose estimation for small and medium figures.

Findings

01

Enhanced accuracy on small and medium human pose datasets.

02

Reduced computational cost compared to traditional heatmap methods.

03

Effective modeling of long-range dependencies with PVT v2 backbone.

Abstract

Human pose estimation on medium and small scales has long been a significant challenge in this field. Most existing methods focus on restoring high-resolution feature maps by stacking multiple costly deconvolutional layers or by continuously aggregating semantic information from low-resolution feature maps while maintaining high-resolution ones, which can lead to information redundancy. Additionally, due to quantization errors, heatmap-based methods have certain disadvantages in accurately locating keypoints of medium and small-scale human figures. In this paper, we propose HRPVT, which utilizes PVT v2 as the backbone to model long-range dependencies. Building on this, we introduce the High-Resolution Pyramid Module (HRPM), designed to generate higher quality high-resolution representations by incorporating the intrinsic inductive biases of Convolutional Neural Networks (CNNs) into the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Gait Recognition and Analysis

MethodsAttention Is All You Need · Softmax · Linear Layer · Layer Normalization · Residual Connection · Absolute Position Encodings · Multi-Head Attention · Spatial-Reduction Attention · Dense Connections · Pyramid Vision Transformer