RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering
Rongyang Zhang, Yuqing Huang, Chengqiang Lu, Qimeng Wang, Yan Gao, Yi Wu, Yao Hu, Yin Xu, Wei Wang, Hao Wang, Enhong Chen

TL;DR
This paper introduces RAG-IGBench, a comprehensive benchmark for evaluating interleaved image-text generation in open-domain question answering, addressing the lack of specialized metrics and datasets for multimodal content quality assessment.
Contribution
We present RAG-IGBench, a novel benchmark with innovative evaluation metrics and a diverse dataset, enabling better assessment of multimodal models in open-domain QA tasks.
Findings
State-of-the-art models show room for improvement in multimodal coherence.
Our metrics correlate highly with human judgments.
Fine-tuned models outperform baseline on RAG-IGBench.
Abstract
In real-world scenarios, providing user queries with visually enhanced responses can considerably benefit understanding and memory, underscoring the great value of interleaved image-text generation. Despite recent progress, like the visual autoregressive model that unifies text and image processing in a single transformer architecture, generating high-quality interleaved content remains challenging. Moreover, evaluations of these interleaved sequences largely remain underexplored, with existing benchmarks often limited by unimodal metrics that inadequately assess the intricacies of combined image-text outputs. To address these issues, we present RAG-IGBench, a thorough benchmark designed specifically to evaluate the task of Interleaved Generation based on Retrieval-Augmented Generation (RAG-IG) in open-domain question answering. RAG-IG integrates multimodal large language models (MLLMs)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
