TL;DR
This paper introduces HMR-ViT, a novel approach combining temporal and kinematic information via a Vision Transformer for improved human mesh recovery from videos.
Contribution
HMR-ViT is the first method to integrate both temporal and kinematic cues using a Vision Transformer for human mesh recovery.
Findings
Achieves competitive results on 3DPW and Human3.6M datasets.
Utilizes a Channel Rearranging Matrix to spatially organize kinematic features.
Demonstrates the effectiveness of combining temporal and kinematic information.
Abstract
Human Mesh Recovery (HMR) from an image is a challenging problem because of the inherent ambiguity of the task. Existing HMR methods utilized either temporal information or kinematic relationships to achieve higher accuracy, but there is no method using both. Hence, we propose "Video Inference for Human Mesh Recovery with Vision Transformer (HMR-ViT)" that can take into account both temporal and kinematic information. In HMR-ViT, a Temporal-kinematic Feature Image is constructed using feature vectors obtained from video frames by an image encoder. When generating the feature image, we use a Channel Rearranging Matrix (CRM) so that similar kinematic features could be located spatially close together. The feature image is then further encoded using Vision Transformer, and the SMPL pose and shape parameters are finally inferred using a regression network. Extensive evaluation on the 3DPW…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Video Inference for Human Mesh Recovery with Vision Transformer· youtube
