Disambiguating Monocular Reconstruction of 3D Clothed Human with Spatial-Temporal Transformer
Yong Deng, Baoxing Li, Xu Zhao

TL;DR
This paper introduces a Spatial-Temporal Transformer network that improves monocular 3D clothed human reconstruction by capturing global and temporal information, addressing ambiguities caused by viewpoint and lighting conditions.
Contribution
The proposed STT network integrates spatial and temporal transformers to enhance normal map prediction and feature consistency, advancing monocular 3D human reconstruction techniques.
Findings
Outperforms state-of-the-art methods on Adobe and MonoPerfCap datasets.
Maintains robust generalization under low-light outdoor conditions.
Effectively reduces ambiguity in back details and local image variations.
Abstract
Reconstructing 3D clothed humans from monocular camera data is highly challenging due to viewpoint limitations and image ambiguity. While implicit function-based approaches, combined with prior knowledge from parametric models, have made significant progress, there are still two notable problems. Firstly, the back details of human models are ambiguous due to viewpoint invisibility. The quality of the back details depends on the back normal map predicted by a convolutional neural network (CNN). However, the CNN lacks global information awareness for comprehending the back texture, resulting in excessively smooth back details. Secondly, a single image suffers from local ambiguity due to lighting conditions and body movement. However, implicit functions are highly sensitive to pixel variations in ambiguous regions. To address these ambiguities, we propose the Spatial-Temporal Transformer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Textile materials and evaluations · Industrial Vision Systems and Defect Detection
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Multi-Head Attention · Adam · Dropout
