Text-to-Image Generation with Attention Based Recurrent Neural Networks
Tehseen Zia, Shahan Arif, Shakeeb Murtaza, and Mirza Ahsan Ullah

TL;DR
This paper introduces a stable, attention-based recurrent neural network model for text-to-image generation that outperforms previous methods on standard datasets by effectively capturing word-to-pixel dependencies.
Contribution
The authors propose a novel attention-based encoder and autoregressive decoder for stable, high-quality caption-based image generation, addressing limitations of prior latent variable and GAN models.
Findings
Outperforms existing approaches on MS COCO and MNIST datasets
Generates higher quality images as measured by Structural Similarity Index
Demonstrates stable training process with attention-based architecture
Abstract
Conditional image modeling based on textual descriptions is a relatively new domain in unsupervised learning. Previous approaches use a latent variable model and generative adversarial networks. While the formers are approximated by using variational auto-encoders and rely on the intractable inference that can hamper their performance, the latter is unstable to train due to Nash equilibrium based objective function. We develop a tractable and stable caption-based image generation model. The model uses an attention-based encoder to learn word-to-pixel dependencies. A conditional autoregressive based decoder is used for learning pixel-to-pixel dependencies and generating images. Experimentations are performed on Microsoft COCO, and MNIST-with-captions datasets and performance is evaluated by using the Structural Similarity Index. Results show that the proposed model performs better than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
