Viewpoint Textual Inversion: Discovering Scene Representations and 3D   View Control in 2D Diffusion Models

James Burgess; Kuan-Chieh Wang; and Serena Yeung-Levy

arXiv:2309.07986·cs.CV·July 29, 2024·2 cites

Viewpoint Textual Inversion: Discovering Scene Representations and 3D View Control in 2D Diffusion Models

James Burgess, Kuan-Chieh Wang, and Serena Yeung-Levy

PDF

Open Access 1 Repo

TL;DR

This paper reveals that 2D diffusion models implicitly encode 3D scene representations, and introduces ViewNeTI, a method to discover and control 3D viewpoints in generated images, enabling advanced 3D vision applications.

Contribution

We propose ViewNeTI, a neural mapper that discovers 3D view tokens in diffusion models, enabling explicit control of 3D viewpoints in generated images.

Findings

01

The text latent space contains a continuous view-control manifold.

02

Evidence of a generalized view-control manifold across scenes.

03

State-of-the-art results in view-controlled generation and novel view synthesis.

Abstract

Text-to-image diffusion models generate impressive and realistic images, but do they learn to represent the 3D world from only 2D supervision? We demonstrate that yes, certain 3D scene representations are encoded in the text embedding space of models like Stable Diffusion. Our approach, Viewpoint Neural Textual Inversion (ViewNeTI), is to discover 3D view tokens; these tokens control the 3D viewpoint - the rendering pose in a scene - of generated images. Specifically, we train a small neural mapper to take continuous camera viewpoint parameters and predict a view token (a word embedding). This token conditions diffusion generation via cross-attention to produce images with the desired camera viewpoint. Using ViewNeTI as an evaluation tool, we report two findings: first, the text latent space has a continuous view-control manifold for particular 3D scenes; second, we find evidence for a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jmhb0/view_neti
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques

MethodsDiffusion