DT-NVS: Diffusion Transformers for Novel View Synthesis
Wonbong Jang, Jonathan Tremblay, Lourdes Agapito

TL;DR
This paper introduces DT-NVS, a 3D diffusion model with transformer architecture for generalized novel view synthesis from a single image, trained on real-world videos, outperforming existing methods in diversity and quality.
Contribution
The paper presents a novel 3D diffusion model with transformer backbone, new camera conditioning strategies, and a unique training paradigm for real-world, unaligned datasets.
Findings
Outperforms state-of-the-art 3D diffusion models.
Generates diverse and high-quality novel views.
Effective on real-world, unaligned video datasets.
Abstract
Generating novel views of a natural scene, e.g., every-day scenes both indoors and outdoors, from a single view is an under-explored problem, even though it is an organic extension to the object-centric novel view synthesis. Existing diffusion-based approaches focus rather on small camera movements in real scenes or only consider unnatural object-centric scenes, limiting their potential applications in real-world settings. In this paper we move away from these constrained regimes and propose a 3D diffusion model trained with image-only losses on a large-scale dataset of real-world, multi-category, unaligned, and casually acquired videos of everyday scenes. We propose DT-NVS, a 3D-aware diffusion model for generalized novel view synthesis that exploits a transformer-based architecture backbone. We make significant contributions to transformer and self-attention architectures to translate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Image Enhancement Techniques
