TOSS:High-quality Text-guided Novel View Synthesis from a Single Image
Yukai Shi, Jianan Wang, He Cao, Boshi Tang, Xianbiao Qi, Tianyu Yang,, Yukun Huang, Shilong Liu, Lei Zhang, Heung-Yeung Shum

TL;DR
TOSS leverages text guidance and fine-tuned diffusion models to produce high-quality, controllable, and multiview-consistent novel view synthesis from a single image, addressing limitations of previous image-to-image translation methods.
Contribution
The paper introduces TOSS, a novel framework that integrates text semantics with diffusion models for improved single-image NVS with explicit control and detail preservation.
Findings
TOSS outperforms Zero-1-to-3 in plausibility and multiview consistency.
Text guidance enhances control over generated views.
Architecture improvements improve pose accuracy and detail retention.
Abstract
In this paper, we present TOSS, which introduces text to the task of novel view synthesis (NVS) from just a single RGB image. While Zero-1-to-3 has demonstrated impressive zero-shot open-set NVS capability, it treats NVS as a pure image-to-image translation problem. This approach suffers from the challengingly under-constrained nature of single-view NVS: the process lacks means of explicit user control and often results in implausible NVS generations. To address this limitation, TOSS uses text as high-level semantic information to constrain the NVS solution space. TOSS fine-tunes text-to-image Stable Diffusion pre-trained on large-scale text-image pairs and introduces modules specifically tailored to image and camera pose conditioning, as well as dedicated training for pose correctness and preservation of fine details. Comprehensive experiments are conducted with results showing that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
