TIME: Text and Image Mutual-Translation Adversarial Networks
Bingchen Liu, Kunpeng Song, Yizhe Zhu, Gerard de Melo, Ahmed Elgammal

TL;DR
TIME introduces a lightweight, joint text-image translation adversarial network that improves text-to-image generation without pre-training, achieving state-of-the-art results on benchmark datasets.
Contribution
The paper presents a novel mutual-translation adversarial framework that jointly trains a T2I generator and an image captioning discriminator using Transformers and an annealing hinge loss, without extra modules or pre-training.
Findings
Achieves SOTA performance on CUB dataset with Inception Score 4.91
Attains Fréchet Inception Distance of 14.3 on CUB
Shows promising results on MS-COCO for captioning and vision-language tasks
Abstract
Focusing on text-to-image (T2I) generation, we propose Text and Image Mutual-Translation Adversarial Networks (TIME), a lightweight but effective model that jointly learns a T2I generator G and an image captioning discriminator D under the Generative Adversarial Network framework. While previous methods tackle the T2I problem as a uni-directional task and use pre-trained language models to enforce the image--text consistency, TIME requires neither extra modules nor pre-training. We show that the performance of G can be boosted substantially by training it jointly with D as a language model. Specifically, we adopt Transformers to model the cross-modal connections between the image features and word embeddings, and design an annealing conditional hinge loss that dynamically balances the adversarial learning. In our experiments, TIME achieves state-of-the-art (SOTA) performance on the CUB…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Cancer-related molecular mechanisms research
