TOSS:High-quality Text-guided Novel View Synthesis from a Single Image

Yukai Shi; Jianan Wang; He Cao; Boshi Tang; Xianbiao Qi; Tianyu Yang,; Yukun Huang; Shilong Liu; Lei Zhang; Heung-Yeung Shum

arXiv:2310.10644·cs.CV·October 17, 2023·2 cites

TOSS:High-quality Text-guided Novel View Synthesis from a Single Image

Yukai Shi, Jianan Wang, He Cao, Boshi Tang, Xianbiao Qi, Tianyu Yang,, Yukun Huang, Shilong Liu, Lei Zhang, Heung-Yeung Shum

PDF

Open Access

TL;DR

TOSS leverages text guidance and fine-tuned diffusion models to produce high-quality, controllable, and multiview-consistent novel view synthesis from a single image, addressing limitations of previous image-to-image translation methods.

Contribution

The paper introduces TOSS, a novel framework that integrates text semantics with diffusion models for improved single-image NVS with explicit control and detail preservation.

Findings

01

TOSS outperforms Zero-1-to-3 in plausibility and multiview consistency.

02

Text guidance enhances control over generated views.

03

Architecture improvements improve pose accuracy and detail retention.

Abstract

In this paper, we present TOSS, which introduces text to the task of novel view synthesis (NVS) from just a single RGB image. While Zero-1-to-3 has demonstrated impressive zero-shot open-set NVS capability, it treats NVS as a pure image-to-image translation problem. This approach suffers from the challengingly under-constrained nature of single-view NVS: the process lacks means of explicit user control and often results in implausible NVS generations. To address this limitation, TOSS uses text as high-level semantic information to constrain the NVS solution space. TOSS fine-tunes text-to-image Stable Diffusion pre-trained on large-scale text-image pairs and introduces modules specifically tailored to image and camera pose conditioning, as well as dedicated training for pose correctness and preservation of fine details. Comprehensive experiments are conducted with results showing that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion