Interleaved Scene Graphs for Interleaved Text-and-Image Generation   Assessment

Dongping Chen; Ruoxi Chen; Shu Pu; Zhaoyi Liu; Yanru Wu; Caixi Chen,; Benlin Liu; Yue Huang; Yao Wan; Pan Zhou; Ranjay Krishna

arXiv:2411.17188·cs.CV·March 25, 2025

Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment

Dongping Chen, Ruoxi Chen, Shu Pu, Zhaoyi Liu, Yanru Wu, Caixi Chen,, Benlin Liu, Yue Huang, Yao Wan, Pan Zhou, Ranjay Krishna

PDF

Open Access 1 Datasets

TL;DR

This paper introduces ISG, a comprehensive evaluation framework and benchmark for interleaved text-and-image generation, revealing current models' limitations and proposing a baseline approach to improve performance.

Contribution

The paper presents ISG, a multi-level evaluation framework, a new benchmark dataset ISG-Bench, and a baseline agent for interleaved text-and-image generation tasks.

Findings

01

Unified models perform poorly on interleaved generation.

02

Compositional approaches improve performance by 111%.

03

Baseline agent improves results by 122%.

Abstract

Many real-world user queries (e.g. "How do to make egg fried rice?") could benefit from systems capable of generating responses with both textual steps with accompanying images, similar to a cookbook. Models designed to generate interleaved text and images face challenges in ensuring consistency within and across these modalities. To address these challenges, we present ISG, a comprehensive evaluation framework for interleaved text-and-image generation. ISG leverages a scene graph structure to capture relationships between text and image blocks, evaluating responses on four levels of granularity: holistic, structural, block-level, and image-specific. This multi-tiered evaluation allows for a nuanced assessment of consistency, coherence, and accuracy, and provides interpretable question-answer feedback. In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

shuaishuaicdp/ISG-Bench
dataset· 2.0k dl
2.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Motion and Animation