RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering

Rongyang Zhang; Yuqing Huang; Chengqiang Lu; Qimeng Wang; Yan Gao; Yi Wu; Yao Hu; Yin Xu; Wei Wang; Hao Wang; Enhong Chen

arXiv:2512.05119·cs.IR·December 8, 2025

RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering

Rongyang Zhang, Yuqing Huang, Chengqiang Lu, Qimeng Wang, Yan Gao, Yi Wu, Yao Hu, Yin Xu, Wei Wang, Hao Wang, Enhong Chen

PDF

Open Access 1 Datasets

TL;DR

This paper introduces RAG-IGBench, a comprehensive benchmark for evaluating interleaved image-text generation in open-domain question answering, addressing the lack of specialized metrics and datasets for multimodal content quality assessment.

Contribution

We present RAG-IGBench, a novel benchmark with innovative evaluation metrics and a diverse dataset, enabling better assessment of multimodal models in open-domain QA tasks.

Findings

01

State-of-the-art models show room for improvement in multimodal coherence.

02

Our metrics correlate highly with human judgments.

03

Fine-tuned models outperform baseline on RAG-IGBench.

Abstract

In real-world scenarios, providing user queries with visually enhanced responses can considerably benefit understanding and memory, underscoring the great value of interleaved image-text generation. Despite recent progress, like the visual autoregressive model that unifies text and image processing in a single transformer architecture, generating high-quality interleaved content remains challenging. Moreover, evaluations of these interleaved sequences largely remain underexplored, with existing benchmarks often limited by unimodal metrics that inadequately assess the intricacies of combined image-text outputs. To address these issues, we present RAG-IGBench, a thorough benchmark designed specifically to evaluate the task of Interleaved Generation based on Retrieval-Augmented Generation (RAG-IG) in open-domain question answering. RAG-IG integrates multimodal large language models (MLLMs)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Muyi13/RAG-IGBench
dataset· 9 dl
9 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning