Conditional Text-to-Image Generation with Reference Guidance

Taewook Kim; Ze Wang; Zhengyuan Yang; Jiang Wang; Lijuan Wang; Zicheng Liu; Qiang Qiu

arXiv:2411.16713·cs.CV·December 15, 2025

Conditional Text-to-Image Generation with Reference Guidance

Taewook Kim, Ze Wang, Zhengyuan Yang, Jiang Wang, Lijuan Wang, Zicheng Liu, Qiang Qiu

PDF

Open Access

TL;DR

This paper introduces reference-guided conditioning for text-to-image diffusion models, improving the rendering of specific subjects like text and extending capabilities to multilingual and logo generation.

Contribution

It proposes expert plugins that incorporate visual reference conditions into diffusion models, enhancing accuracy and generalization for specialized text and image synthesis tasks.

Findings

01

Superior results on all tasks compared to existing methods

02

Efficient plugins with only 28.55M parameters

03

Extended capabilities to multilingual and non-English text generation

Abstract

Text-to-image diffusion models have demonstrated tremendous success in synthesizing visually stunning images given textual instructions. Despite remarkable progress in creating high-fidelity visuals, text-to-image models can still struggle with precisely rendering subjects, such as text spelling. To address this challenge, this paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate. In addition, this reference condition empowers the model to be conditioned in ways that the vocabularies of the text tokenizer cannot adequately represent, and further extends the model's generalization to novel capabilities such as generating non-English text spellings. We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references. Each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications