Pixel-Aligned Multi-View Generation with Depth Guided Decoder

Zhenggang Tang; Peiye Zhuang; Chaoyang Wang; Aliaksandr Siarohin; Yash; Kant; Alexander Schwing; Sergey Tulyakov; Hsin-Ying Lee

arXiv:2408.14016·cs.CV·August 27, 2024

Pixel-Aligned Multi-View Generation with Depth Guided Decoder

Zhenggang Tang, Peiye Zhuang, Chaoyang Wang, Aliaksandr Siarohin, Yash, Kant, Alexander Schwing, Sergey Tulyakov, Hsin-Ying Lee

PDF

Open Access

TL;DR

This paper introduces a novel pixel-aligned multi-view generation method that incorporates depth-guided attention in the VAE decoder, improving multi-view consistency and aiding 3D reconstruction from single images.

Contribution

It proposes a depth-truncated epipolar attention mechanism within a latent diffusion framework, enhancing pixel alignment across views and robustness to inaccurate depth during inference.

Findings

01

Improved pixel alignment across multi-view images.

02

Enhanced performance in multi-view to 3D reconstruction.

03

Effective handling of inaccurate depth during inference.

Abstract

The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder and a U-Net diffusion model. Specifically, these generation methods usually fix VAE and finetune the U-Net only. However, the significant downscaling of the latent vectors computed from the input images and independent decoding leads to notable pixel-level misalignment across multiple views. To address this, we propose a novel method for pixel-level image-to-multi-view generation. Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model. Specifically, we introduce a depth-truncated epipolar attention, enabling the model to focus on spatially adjacent regions while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Computer Graphics and Visualization Techniques

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Softmax · Attention Is All You Need · Concatenated Skip Connection · Max Pooling · Convolution · U-Net · Diffusion · Focus