Fixing the Perspective: A Critical Examination of Zero-1-to-3
Jack Yu, Xueying Jia, Charlie Sun, Prince Wang

TL;DR
This paper critically examines Zero-1-to-3's approach to novel view synthesis, identifying implementation issues in its cross-attention mechanism and proposing corrections and enhancements to improve view consistency and accuracy.
Contribution
It uncovers a discrepancy in Zero-1-to-3's implementation and introduces corrected and improved architectures for better multi-view synthesis.
Findings
Identified a critical discrepancy in Zero-1-to-3's cross-attention processing.
Proposed a corrected implementation for more effective use of cross-attention.
Preliminary results indicate potential improvements in view synthesis quality.
Abstract
Novel view synthesis is a fundamental challenge in image-to-3D generation, requiring the generation of target view images from a set of conditioning images and their relative poses. While recent approaches like Zero-1-to-3 have demonstrated promising results using conditional latent diffusion models, they face significant challenges in generating consistent and accurate novel views, particularly when handling multiple conditioning images. In this work, we conduct a thorough investigation of Zero-1-to-3's cross-attention mechanism within the Spatial Transformer of the diffusion 2D-conditional UNet. Our analysis reveals a critical discrepancy between Zero-1-to-3's theoretical framework and its implementation, specifically in the processing of image-conditional context. We propose two significant improvements: (1) a corrected implementation that enables effective utilization of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques
MethodsDense Connections · Label Smoothing · Dropout · Linear Layer · Layer Normalization · Byte Pair Encoding · Adam · Residual Connection · Spatial Transformer · Softmax
