End-to-end Training for Text-to-Image Synthesis using Dual-Text Embeddings
Yeruru Asrar Ahmed, Anurag Mittal

TL;DR
This paper introduces an end-to-end training approach for text-to-image synthesis that learns two specialized text embeddings, improving image realism and alignment, and demonstrating versatility in manipulation tasks.
Contribution
It proposes a novel end-to-end learning framework with dual embeddings tailored for T2I synthesis, outperforming methods using pre-trained generic text embeddings.
Findings
Dual embeddings outperform shared ones in T2I tasks
End-to-end training improves image realism and alignment
Embeddings are effective for text-to-image manipulation
Abstract
Text-to-Image (T2I) synthesis is a challenging task that requires modeling complex interactions between two modalities ( i.e., text and image). A common framework adopted in recent state-of-the-art approaches to achieving such multimodal interactions is to bootstrap the learning process with pre-trained image-aligned text embeddings trained using contrastive loss. Furthermore, these embeddings are typically trained generically and reused across various synthesis models. In contrast, we explore an approach to learning text embeddings specifically tailored to the T2I synthesis network, trained in an end-to-end fashion. Further, we combine generative and contrastive training and use two embeddings, one optimized to enhance the photo-realism of the generated images, and the other seeking to capture text-to-image alignment. A comprehensive set of experiments on three text-to-image benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Video Analysis and Summarization · Image Retrieval and Classification Techniques
