Taming Encoder for Zero Fine-tuning Image Customization with   Text-to-Image Diffusion Models

Xuhui Jia; Yang Zhao; Kelvin C.K. Chan; Yandong Li; Han Zhang; Boqing; Gong; Tingbo Hou; Huisheng Wang; Yu-Chuan Su

arXiv:2304.02642·cs.CV·April 6, 2023·21 cites

Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

Xuhui Jia, Yang Zhao, Kelvin C.K. Chan, Yandong Li, Han Zhang, Boqing, Gong, Tingbo Hou, Huisheng Wang, Yu-Chuan Su

PDF

Open Access

TL;DR

This paper introduces a fast, encoder-based framework for zero-shot image customization that generates high-quality, diverse images of user-specified objects without test-time optimization, leveraging object-specific embeddings.

Contribution

It presents a novel encoder-based approach with a regularized joint training scheme and caption generation to faithfully incorporate object identity into text-to-image models.

Findings

01

Produces high-quality, diverse images of customized objects

02

Eliminates the need for test-time optimization

03

Maintains object fidelity and appearance diversity

Abstract

This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches, which often employ a per-object optimization paradigm. Our framework adopts an encoder to capture high-level identifiable semantics of objects, producing an object-specific embedding with only a single feed-forward pass. The acquired object embedding is then passed to a text-to-image synthesis model for subsequent generation. To effectively blend a object-aware embedding space into a well developed text-to-image model under the same generation context, we investigate different network designs and training strategies, and propose a simple yet effective regularized joint training scheme with an object identity preservation loss. Additionally, we propose a caption generation scheme…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Video Analysis and Summarization