EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding
Kai Zou, Hongbo Liu, Dian Zheng, Jianxiong Gao, Zhiwei Zhao, Bin Liu

TL;DR
EchoGen introduces a unified framework that jointly improves layout-to-image generation and image grounding through progressive training, achieving state-of-the-art results and demonstrating the mutual benefits of integrated task optimization.
Contribution
The paper proposes a novel progressive training strategy for joint layout-image tasks, overcoming optimization challenges and enhancing performance in both areas.
Findings
Achieves state-of-the-art results on layout-to-image generation benchmarks.
Demonstrates significant improvements in image grounding accuracy.
Shows synergistic benefits from joint training of both tasks.
Abstract
In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with accurate layouts and high fidelity to text descriptions (e.g., spatial relationships), while grounding the image robustly at the same time. We believe that image grounding possesses strong text and layout understanding abilities, which can compensate for the corresponding limitations in layout-to-image generation. At the same time, images generated from layouts exhibit high diversity in content, thereby enhancing the robustness of image grounding. Jointly training both tasks within a unified model can promote performance improvements for each. However, we identify that this joint training paradigm encounters several optimization challenges and results in restricted performance. To address these issues, we propose progressive training strategies.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection
