Multi-view Human Body Mesh Translator
Xiangjian Jiang, Xuecheng Nie, Zitian Wang, Luoqi Liu, Si Liu

TL;DR
This paper introduces MMT, a multi-view human body mesh translator using vision transformers, which significantly improves mesh recovery accuracy by fusing multi-view features and enforcing geometric consistency.
Contribution
The novel MMT model effectively leverages multi-view images and cross-view alignment to enhance human mesh reconstruction, outperforming existing methods.
Findings
28.8% improvement in MPVE over state-of-the-art on HUMBI dataset
Outperforms existing models by a large margin
Produces high-quality human mesh reconstructions
Abstract
Existing methods for human mesh recovery mainly focus on single-view frameworks, but they often fail to produce accurate results due to the ill-posed setup. Considering the maturity of the multi-view motion capture system, in this paper, we propose to solve the prior ill-posed problem by leveraging multiple images from different views, thus significantly enhancing the quality of recovered meshes. In particular, we present a novel \textbf{M}ulti-view human body \textbf{M}esh \textbf{T}ranslator (MMT) model for estimating human body mesh with the help of vision transformer. Specifically, MMT takes multi-view images as input and translates them to targeted meshes in a single-forward manner. MMT fuses features of different views in both encoding and decoding phases, leading to representations embedded with global information. Additionally, to ensure the tokens are intensively focused on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · 3D Shape Modeling and Analysis · Video Surveillance and Tracking Methods
