TL;DR
This paper explores using 3D MoCap data to enhance human mesh recovery from images and videos, showing that fine-tuning with synthetic data and a new transformer module improve performance significantly.
Contribution
It introduces a method to leverage MoCap data for training image-based models and proposes PoseBERT, a transformer for video-based pose estimation, achieving state-of-the-art results.
Findings
Fine-tuning with MoCap data improves image-based model performance.
PoseBERT effectively incorporates temporal information for video pose estimation.
Proposed methods outperform existing approaches on multiple datasets.
Abstract
Training state-of-the-art models for human body pose and shape recovery from images or videos requires datasets with corresponding annotations that are really hard and expensive to obtain. Our goal in this paper is to study whether poses from 3D Motion Capture (MoCap) data can be used to improve image-based and video-based human mesh recovery methods. We find that fine-tune image-based models with synthetic renderings from MoCap data can increase their performance, by providing them with a wider variety of poses, textures and backgrounds. In fact, we show that simply fine-tuning the batch normalization layers of the model is enough to achieve large gains. We further study the use of MoCap data for video, and introduce PoseBERT, a transformer module that directly regresses the pose parameters and is trained via masked modeling. It is simple, generic and can be plugged on top of any…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTest · Batch Normalization
