CIGLI: Conditional Image Generation from Language & Image
Xiaopeng Lu, Lynnette Ng, Jared Fernandez, Hao Zhu

TL;DR
This paper introduces CIGLI, a new task for generating images from combined language descriptions and images, along with a dataset and a fusion model that outperform existing baselines.
Contribution
The paper presents a novel task, a dedicated dataset, and a fusion model for generating images from both text and image inputs, advancing multi-modal generation research.
Findings
The fusion model outperforms baseline methods in automatic and human evaluations.
A new dataset ensures descriptions contain combined image and text information.
The approach improves multi-modal image generation quality.
Abstract
Multi-modal generation has been widely explored in recent years. Current research directions involve generating text based on an image or vice versa. In this paper, we propose a new task called CIGLI: Conditional Image Generation from Language and Image. Instead of generating an image based on text as in text-image generation, this task requires the generation of an image from a textual description and an image prompt. We designed a new dataset to ensure that the text description describes information from both images, and that solely analyzing the description is insufficient to generate an image. We then propose a novel language-image fusion model which improves the performance over two established baseline methods, as evaluated by quantitative (automatic) and qualitative (human) evaluations. The code and dataset is available at https://github.com/vincentlux/CIGLI.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
