Fine-grained Cross-modal Fusion based Refinement for Text-to-Image Synthesis
Haoran Sun, Yang Wang, Haipeng Liu, Biao Qian

TL;DR
This paper introduces FF-GAN, a novel text-to-image synthesis model that enhances semantic consistency and detail in generated images through fine-grained text-image fusion and global semantic refinement.
Contribution
It proposes a new fusion block and a semantic refinement module to better utilize textual information and improve image quality in text-to-image synthesis.
Findings
Outperforms state-of-the-art methods on CUB-200 and COCO datasets
Produces images with higher semantic consistency and detail
Effective fusion of fine-grained text features into visual generation
Abstract
Text-to-image synthesis refers to generating visual-realistic and semantically consistent images from given textual descriptions. Previous approaches generate an initial low-resolution image and then refine it to be high-resolution. Despite the remarkable progress, these methods are limited in fully utilizing the given texts and could generate text-mismatched images, especially when the text description is complex. We propose a novel Fine-grained text-image Fusion based Generative Adversarial Networks, dubbed FF-GAN, which consists of two modules: Fine-grained text-image Fusion Block (FF-Block) and Global Semantic Refinement (GSR). The proposed FF-Block integrates an attention block and several convolution layers to effectively fuse the fine-grained word-context features into the corresponding visual features, in which the text information is fully used to refine the initial image with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsConvolution
