You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
Baorui Ma, Huachen Gao, Haoge Deng, Zhengxiong Luo, Tiejun Huang, Lulu, Tang, Xinlong Wang

TL;DR
See3D is a scalable, video-based 3D generation model trained on large-scale internet videos, enabling open-world 3D creation without explicit 3D annotations, outperforming prior models on benchmarks.
Contribution
The paper introduces See3D, a novel 3D generation framework trained on large-scale web videos using a new data curation pipeline and a pose-free visual conditioning method.
Findings
Achieves state-of-the-art zero-shot 3D generation performance.
Utilizes 320 million frames from 16 million videos for training.
Outperforms models trained on traditional 3D datasets.
Abstract
Recent 3D generation models typically rely on limited-scale 3D `gold-labels' or 2D diffusion priors for 3D content creation. However, their performance is upper-bounded by constrained 3D priors due to the lack of scalable learning paradigms. In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data -- You See it, You Got it. To achieve this, we first scale up the training data using a proposed data curation pipeline that automatically filters out multi-view inconsistencies and insufficient observations from source videos. This results in a high-quality, richly diverse, large-scale dataset of multi-view images, termed WebVi3D, containing 320M frames from 16M video clips. Nevertheless,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Games and Gamification
MethodsDiffusion
