SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation
Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Yanjun, Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Lei, Yang, Ziwei Liu

TL;DR
SMPLer-X is a large-scale foundation model for expressive human pose and shape estimation, trained on diverse datasets with a ViT-Huge backbone, achieving state-of-the-art results and strong transferability across multiple benchmarks.
Contribution
This work introduces SMPLer-X, the first generalist foundation model for EHPS, leveraging extensive data scaling and vision transformer architectures to improve performance and transferability.
Findings
SMPLer-X achieves state-of-the-art results on seven benchmarks.
Data and model scaling significantly enhance EHPS capabilities.
Finetuning further boosts model performance.
Abstract
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods still depend largely on a confined set of training datasets. In this work, we investigate scaling up EHPS towards the first generalist foundation model (dubbed SMPLer-X), with up to ViT-Huge as the backbone and training with up to 4.5M instances from diverse data sources. With big data and the large model, SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments. 1) For the data scaling, we perform a systematic investigation on 32 EHPS datasets, including a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Hand Gesture Recognition Systems
