OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation

Jie An; Zhengyuan Yang; Linjie Li; Jianfeng Wang; Kevin Lin; Zicheng; Liu; Lijuan Wang; Jiebo Luo

arXiv:2310.07749·cs.CV·November 7, 2023·1 cites

OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation

Jie An, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng, Liu, Lijuan Wang, Jiebo Luo

PDF

Open Access 4 Reviews

TL;DR

OpenLEAF introduces a novel framework for open-domain interleaved image-text generation using prompting of large-language and text-to-image models, enabling high-quality, contextually consistent multimodal content creation and evaluation.

Contribution

The paper presents a new interleaved generation framework and a large multimodal model-based evaluation method for open-domain image-text sequences.

Findings

01

High-quality interleaved image-text generation across various domains

02

Effective evaluation of entity and style consistency using LMMs

03

Validation of LMM evaluation with human assessments

Abstract

This work investigates a challenging task named open-domain interleaved image-text generation, which generates interleaved texts and images following an input query. We propose a new interleaved generation framework based on prompting large-language models (LLMs) and pre-trained text-to-image (T2I) models, namely OpenLEAF. In OpenLEAF, the LLM generates textual descriptions, coordinates T2I models, creates visual prompts for generating images, and incorporates global contexts into the T2I models. This global context improves the entity and style consistencies of images in the interleaved generation. For model assessment, we first propose to use large multi-modal models (LMMs) to evaluate the entity and style consistencies of open-domain interleaved image-text sequences. According to the LMM evaluation on our constructed evaluation set, the proposed interleaved generation framework can…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. The writing is clear and easy to follow. 2. The authors investigate the open-domain interleaved image-text generation task and explore the Language Model Metric's (LMM) assessment ability for this task. 3. The authors propose a framework for adopting existing models, such as ChatGPT and SDXL, to address the interleaved image-text generation task.

Weaknesses

1. Basically, the paper introduces a framework for the interleaved image-text generation task, primarily by combining existing models, such as ChatGPT and SDXL. Given this, the method may not exhibit strong technical novelty. 2. The proposed method may require complex prompt generation to achieve the desired results, which might not be user-friendly for the average user.

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

The problem is interesting and important. Solution is simple.

Weaknesses

Experimental results are not comprehensive. Size of data is very small. No ablation studies. No comparison with baselines. This is really ad hoc prompt engineering for in-context learning (a corollary). The paper really highlights the power of GPT-4.

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

(1) The paper presents a unified solution, offering a fresh perspective in the sphere of open-domain interleaved image-text generation. (2) OpenLEAF emerges as a cohesive framework, harmonizing the strengths of LLMs and T2I models to birth sequences radiant with quality and coherence. (3) The authors enrich the evaluation by incorporating a benchmark dataset and a fortified evaluation methodology, enhancing the objectivity and comprehensiveness of sequence evaluations.

Weaknesses

(1) The paper's architectural foundation seems somewhat pre-ordained, leveraging predefined templates to navigate the realms of interleaved image-text generation. This approach echoes the contours of prompt engineering, utilizing well-established technical components like GPT-4 and SD-XL, making the technical contributions seem somewhat restrained and not profoundly innovative. (2) A shadow of vulnerability seems to cloak the proposed mechanisms aimed at ensuring entity and style consistency. T

Reviewer 04Rating 3· reject, not good enoughConfidence 5

Strengths

This paper presents a interleaved generation framework based on prompting large-language models (LLMs) and pre-trained text-to-image (T2I) models, which includes user query composition, text generation, adding global context. The paper also explore to present an LLM-based evaluation strategy for assessing interleaved content generation in two aspects: Entity Consistency Evaluation and Style Consistency Evaluation.

Weaknesses

1. novelty. The major concern with this paper is its excessive reliance on the APIs of existing models, GPT-4 and SDXL. The method appears more like a prompt engineering approach, devoid of the need for fine-tuning models, analyzing network structures, or delving into training strategies. It lacks the provision of novel insights, and from my perspective, this paper resembles more of a technical report than an academic contribution. 2. Entity Consistency and Style Consistency. Indeed, Entity C

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Computational and Text Analysis Methods