End-to-end Training for Text-to-Image Synthesis using Dual-Text   Embeddings

Yeruru Asrar Ahmed; Anurag Mittal

arXiv:2502.01507·cs.CV·February 4, 2025

End-to-end Training for Text-to-Image Synthesis using Dual-Text Embeddings

Yeruru Asrar Ahmed, Anurag Mittal

PDF

Open Access

TL;DR

This paper introduces an end-to-end training approach for text-to-image synthesis that learns two specialized text embeddings, improving image realism and alignment, and demonstrating versatility in manipulation tasks.

Contribution

It proposes a novel end-to-end learning framework with dual embeddings tailored for T2I synthesis, outperforming methods using pre-trained generic text embeddings.

Findings

01

Dual embeddings outperform shared ones in T2I tasks

02

End-to-end training improves image realism and alignment

03

Embeddings are effective for text-to-image manipulation

Abstract

Text-to-Image (T2I) synthesis is a challenging task that requires modeling complex interactions between two modalities ( i.e., text and image). A common framework adopted in recent state-of-the-art approaches to achieving such multimodal interactions is to bootstrap the learning process with pre-trained image-aligned text embeddings trained using contrastive loss. Furthermore, these embeddings are typically trained generically and reused across various synthesis models. In contrast, we explore an approach to learning text embeddings specifically tailored to the T2I synthesis network, trained in an end-to-end fashion. Further, we combine generative and contrastive training and use two embeddings, one optimized to enhance the photo-realism of the generated images, and the other seeking to capture text-to-image alignment. A comprehensive set of experiments on three text-to-image benchmark…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Video Analysis and Summarization · Image Retrieval and Classification Techniques