TL;DR
LatteGAN is a novel model that improves multi-turn text-guided image manipulation by using a visually guided language attention module and a text-conditioned discriminator, achieving state-of-the-art results.
Contribution
The paper introduces LatteGAN, a new architecture with a visually guided language attention module and a text-conditioned discriminator for enhanced multi-turn image manipulation.
Findings
Achieves state-of-the-art performance on CoDraw and i-CLEVR datasets.
Addresses under-generation and quality issues in multi-turn image manipulation.
Demonstrates significant improvement over previous models.
Abstract
Text-guided image manipulation tasks have recently gained attention in the vision-and-language community. While most of the prior studies focused on single-turn manipulation, our goal in this paper is to address the more challenging multi-turn image manipulation (MTIM) task. Previous models for this task successfully generate images iteratively, given a sequence of instructions and a previously generated image. However, this approach suffers from under-generation and a lack of generated quality of the objects that are described in the instructions, which consequently degrades the overall performance. To overcome these problems, we present a novel architecture called a Visually Guided Language Attention GAN (LatteGAN). Here, we address the limitations of the previous approaches by introducing a Visually Guided Language Attention (Latte) module, which extracts fine-grained text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMax Pooling · Concatenated Skip Connection · *Communicated@Fast*How Do I Communicate to Expedia? · Convolution · U-Net
