Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models
Xuhui Jia, Yang Zhao, Kelvin C.K. Chan, Yandong Li, Han Zhang, Boqing, Gong, Tingbo Hou, Huisheng Wang, Yu-Chuan Su

TL;DR
This paper introduces a fast, encoder-based framework for zero-shot image customization that generates high-quality, diverse images of user-specified objects without test-time optimization, leveraging object-specific embeddings.
Contribution
It presents a novel encoder-based approach with a regularized joint training scheme and caption generation to faithfully incorporate object identity into text-to-image models.
Findings
Produces high-quality, diverse images of customized objects
Eliminates the need for test-time optimization
Maintains object fidelity and appearance diversity
Abstract
This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches, which often employ a per-object optimization paradigm. Our framework adopts an encoder to capture high-level identifiable semantics of objects, producing an object-specific embedding with only a single feed-forward pass. The acquired object embedding is then passed to a text-to-image synthesis model for subsequent generation. To effectively blend a object-aware embedding space into a well developed text-to-image model under the same generation context, we investigate different network designs and training strategies, and propose a simple yet effective regularized joint training scheme with an object identity preservation loss. Additionally, we propose a caption generation scheme…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Video Analysis and Summarization
