ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

Lukas H\"ollein; Alja\v{z} Bo\v{z}i\v{c}; Norman M\"uller; David; Novotny; Hung-Yu Tseng; Christian Richardt; Michael Zollh\"ofer; Matthias; Nie{\ss}ner

arXiv:2403.01807·cs.CV·July 30, 2024·3 cites

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

Lukas H\"ollein, Alja\v{z} Bo\v{z}i\v{c}, Norman M\"uller, David, Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollh\"ofer, Matthias, Nie{\ss}ner

PDF

Open Access 1 Repo

TL;DR

ViewDiff introduces a novel approach for 3D-consistent image generation using pretrained text-to-image models, integrating 3D volume rendering and cross-frame attention to produce high-quality, multi-view images from real-world data.

Contribution

The paper presents a new method that leverages pretrained text-to-image models with integrated 3D volume rendering and autoregressive generation for consistent 3D asset creation.

Findings

01

Produces highly 3D-consistent images from real-world data

02

Achieves 30% lower FID and 37% lower KID scores compared to previous methods

03

Generates diverse high-quality shapes and textures in authentic environments

Abstract

3D asset generation is getting massive amounts of attention, inspired by the recent success of text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data, which often results in non-photorealistic 3D objects without backgrounds. In this paper, we present a method that leverages pretrained text-to-image models as a prior, and learn to generate multi-view images in a single denoising process from real-world data. Concretely, we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover, we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/viewdiff
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling · Diffusion · Concatenated Skip Connection · Convolution · U-Net