ViTPose++: Vision Transformer for Generic Body Pose Estimation

Yufei Xu; Jing Zhang; Qiming Zhang; Dacheng Tao

arXiv:2212.04246·cs.CV·December 15, 2023·6 cites

ViTPose++: Vision Transformer for Generic Body Pose Estimation

Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao

PDF

Open Access 2 Repos 7 Models

TL;DR

ViTPose++ demonstrates that plain vision transformers are highly effective and flexible for generic body pose estimation, achieving state-of-the-art results across multiple benchmarks with scalable model sizes and transfer learning capabilities.

Contribution

Introduces ViTPose++, a simple yet powerful vision transformer-based framework for body pose estimation, with a novel ViTPose+ model for heterogeneous keypoint tasks and knowledge transfer methods.

Findings

01

ViTPose outperforms existing methods on MS COCO benchmark.

02

ViTPose+ achieves state-of-the-art on multiple pose estimation datasets.

03

Large ViTPose models can transfer knowledge effectively to smaller models.

Abstract

In this paper, we show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model dubbed ViTPose. Specifically, ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints in either a top-down or a bottom-up manner. It can be scaled up from about 20M to 1B parameters by taking advantage of the scalable model capacity and high parallelism of the vision transformer, setting a new Pareto front for throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, and pre-training and fine-tuning strategy. Based on the flexibility, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Linear Layer · Dense Connections · Residual Connection · Layer Normalization · Vision Transformer