Text-to-Image GAN with Pretrained Representations
Xiaozhou You, Jian Zhang

TL;DR
TIGER introduces a novel GAN architecture utilizing pretrained vision models and high-capacity fusion blocks to achieve faster, more accurate text-to-image synthesis, outperforming existing methods on standard and zero-shot tasks.
Contribution
The paper proposes TIGER, a text-to-image GAN with a vision-empowered discriminator and a high-capacity generator, enhancing performance and speed over prior models.
Findings
Achieves state-of-the-art FID scores on COCO and CUB datasets.
Demonstrates superior zero-shot synthesis with fewer parameters.
Faster inference compared to diffusion and autoregressive models.
Abstract
Generating desired images conditioned on given text descriptions has received lots of attention. Recently, diffusion models and autoregressive models have demonstrated their outstanding expressivity and gradually replaced GAN as the favored architectures for text-to-image synthesis. However, they still face some obstacles: slow inference speed and expensive training costs. To achieve more powerful and faster text-to-image synthesis under complex scenes, we propose TIGER, a text-to-image GAN with pretrained representations. To be specific, we propose a vision-empowered discriminator and a high-capacity generator. (i) The vision-empowered discriminator absorbs the complex scene understanding ability and the domain generalization ability from pretrained vision models to enhance model performance. Unlike previous works, we explore stacking multiple pretrained models in our discriminator to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Handwritten Text Recognition Techniques · AI in cancer detection
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion
