MTFusion: Reconstructing Any 3D Object from Single Image Using Multi-word Textual Inversion
Yu Liu, Ruowei Wang, Jiaqi Li, Zixiang Xu, Qijun Zhao

TL;DR
MTFusion introduces a novel method combining multi-word textual inversion and image data to reconstruct detailed 3D models from a single image, surpassing existing techniques in fidelity and speed.
Contribution
The paper presents a new multi-word textual inversion technique and an enhanced 3D generation pipeline using FlexiCubes, improving detail capture and training efficiency.
Findings
Outperforms existing methods on synthetic and real images
Achieves higher fidelity in surface and texture details
Faster training due to improved decoder network
Abstract
Reconstructing 3D models from single-view images is a long-standing problem in computer vision. The latest advances for single-image 3D reconstruction extract a textual description from the input image and further utilize it to synthesize 3D models. However, existing methods focus on capturing a single key attribute of the image (e.g., object type, artistic style) and fail to consider the multi-perspective information required for accurate 3D reconstruction, such as object shape and material properties. Besides, the reliance on Neural Radiance Fields hinders their ability to reconstruct intricate surfaces and texture details. In this work, we propose MTFusion, which leverages both image data and textual descriptions for high-fidelity 3D reconstruction. Our approach consists of two stages. First, we adopt a novel multi-word textual inversion technique to extract a detailed text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsADaptive gradient method with the OPTimal convergence rate · Focus
