ViTPose++: Vision Transformer for Generic Body Pose Estimation
Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao

TL;DR
ViTPose++ demonstrates that plain vision transformers are highly effective and flexible for generic body pose estimation, achieving state-of-the-art results across multiple benchmarks with scalable model sizes and transfer learning capabilities.
Contribution
Introduces ViTPose++, a simple yet powerful vision transformer-based framework for body pose estimation, with a novel ViTPose+ model for heterogeneous keypoint tasks and knowledge transfer methods.
Findings
ViTPose outperforms existing methods on MS COCO benchmark.
ViTPose+ achieves state-of-the-art on multiple pose estimation datasets.
Large ViTPose models can transfer knowledge effectively to smaller models.
Abstract
In this paper, we show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model dubbed ViTPose. Specifically, ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints in either a top-down or a bottom-up manner. It can be scaled up from about 20M to 1B parameters by taking advantage of the scalable model capacity and high parallelism of the vision transformer, setting a new Pareto front for throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, and pre-training and fine-tuning strategy. Based on the flexibility, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗usyd-community/vitpose-plus-smallmodel· 17k dl· ♡ 517k dl♡ 5
- 🤗usyd-community/vitpose-plus-largemodel· 13k dl· ♡ 313k dl♡ 3
- 🤗usyd-community/vitpose-plus-hugemodel· 40k dl· ♡ 1540k dl♡ 15
- 🤗amorrissette/vitpose-plus-smallmodel
- 🤗amorrissette/vitpose-plus-largemodel
- 🤗amorrissette/vitpose-plus-hugemodel
- 🤗aayuks/vitpose-plus-smallmodel· 22 dl22 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Linear Layer · Dense Connections · Residual Connection · Layer Normalization · Vision Transformer
