Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment
Dongping Chen, Ruoxi Chen, Shu Pu, Zhaoyi Liu, Yanru Wu, Caixi Chen,, Benlin Liu, Yue Huang, Yao Wan, Pan Zhou, Ranjay Krishna

TL;DR
This paper introduces ISG, a comprehensive evaluation framework and benchmark for interleaved text-and-image generation, revealing current models' limitations and proposing a baseline approach to improve performance.
Contribution
The paper presents ISG, a multi-level evaluation framework, a new benchmark dataset ISG-Bench, and a baseline agent for interleaved text-and-image generation tasks.
Findings
Unified models perform poorly on interleaved generation.
Compositional approaches improve performance by 111%.
Baseline agent improves results by 122%.
Abstract
Many real-world user queries (e.g. "How do to make egg fried rice?") could benefit from systems capable of generating responses with both textual steps with accompanying images, similar to a cookbook. Models designed to generate interleaved text and images face challenges in ensuring consistency within and across these modalities. To address these challenges, we present ISG, a comprehensive evaluation framework for interleaved text-and-image generation. ISG leverages a scene graph structure to capture relationships between text and image blocks, evaluating responses on four levels of granularity: holistic, structural, block-level, and image-specific. This multi-tiered evaluation allows for a nuanced assessment of consistency, coherence, and accuracy, and provides interpretable question-answer feedback. In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Motion and Animation
