Anything-3D: Towards Single-view Anything Reconstruction in the Wild
Qiuhong Shen, Xingyi Yang, Xinchao Wang

TL;DR
Anything-3D introduces a novel framework combining visual-language models and segmentation techniques to enable accurate single-view 3D object reconstruction in diverse real-world scenarios.
Contribution
The paper presents a new method that integrates multiple models for reliable single-view 3D reconstruction, addressing limitations of previous approaches.
Findings
Produces detailed 3D reconstructions for various objects
Demonstrates robustness across diverse datasets
Outperforms existing methods in accuracy
Abstract
3D reconstruction from a single-RGB image in unconstrained real-world scenarios presents numerous challenges due to the inherent diversity and complexity of objects and environments. In this paper, we introduce Anything-3D, a methodical framework that ingeniously combines a series of visual-language models and the Segment-Anything object segmentation model to elevate objects to 3D, yielding a reliable and versatile system for single-view conditioned 3D reconstruction task. Our approach employs a BLIP model to generate textural descriptions, utilizes the Segment-Anything model for the effective extraction of objects of interest, and leverages a text-to-image diffusion model to lift object into a neural radiance field. Demonstrating its ability to produce accurate and detailed 3D reconstructions for a wide array of objects, \emph{Anything-3D\footnotemark[2]} shows promise in addressing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Human Pose and Action Recognition
MethodsDiffusion · BLIP: Bootstrapping Language-Image Pre-training
