Layout-Bridging Text-to-Image Synthesis
Jiadong Liang, Wenjie Pei, Feng Lu

TL;DR
This paper introduces a novel approach for text-to-image synthesis that emphasizes effective layout modeling through Transformer-based text-to-layout generation and layout-to-image synthesis, improving semantic consistency and spatial accuracy.
Contribution
It proposes a new Transformer-based framework for joint text-to-layout and layout-to-image synthesis, along with a novel Layout Quality Score metric for evaluating layout quality.
Findings
Outperforms state-of-the-art methods in layout prediction
Achieves higher image synthesis quality from text descriptions
Demonstrates effective modeling of spatial relationships
Abstract
The crux of text-to-image synthesis stems from the difficulty of preserving the cross-modality semantic consistency between the input text and the synthesized image. Typical methods, which seek to model the text-to-image mapping directly, could only capture keywords in the text that indicates common objects or actions but fail to learn their spatial distribution patterns. An effective way to circumvent this limitation is to generate an image layout as guidance, which is attempted by a few methods. Nevertheless, these methods fail to generate practically effective layouts due to the diversity of input text and object location. In this paper we push for effective modeling in both text-to-layout generation and layout-to-image synthesis. Specifically, we formulate the text-to-layout generation as a sequence-to-sequence modeling task, and build our model upon Transformer to learn the spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Human Motion and Animation · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Absolute Position Encodings · Label Smoothing · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Dropout
