Pixel-Aligned Multi-View Generation with Depth Guided Decoder
Zhenggang Tang, Peiye Zhuang, Chaoyang Wang, Aliaksandr Siarohin, Yash, Kant, Alexander Schwing, Sergey Tulyakov, Hsin-Ying Lee

TL;DR
This paper introduces a novel pixel-aligned multi-view generation method that incorporates depth-guided attention in the VAE decoder, improving multi-view consistency and aiding 3D reconstruction from single images.
Contribution
It proposes a depth-truncated epipolar attention mechanism within a latent diffusion framework, enhancing pixel alignment across views and robustness to inaccurate depth during inference.
Findings
Improved pixel alignment across multi-view images.
Enhanced performance in multi-view to 3D reconstruction.
Effective handling of inaccurate depth during inference.
Abstract
The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder and a U-Net diffusion model. Specifically, these generation methods usually fix VAE and finetune the U-Net only. However, the significant downscaling of the latent vectors computed from the input images and independent decoding leads to notable pixel-level misalignment across multiple views. To address this, we propose a novel method for pixel-level image-to-multi-view generation. Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model. Specifically, we introduce a depth-truncated epipolar attention, enabling the model to focus on spatially adjacent regions while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Computer Graphics and Visualization Techniques
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Softmax · Attention Is All You Need · Concatenated Skip Connection · Max Pooling · Convolution · U-Net · Diffusion · Focus
