Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent

Yangning Li; Yinghui Li; Xinyu Wang; Yong Jiang; Zhen Zhang; Xinran Zheng; Hui Wang; Hai-Tao Zheng; Philip S. Yu; Fei Huang; Jingren Zhou

arXiv:2411.02937·cs.CL·May 27, 2025·2 cites

Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent

Yangning Li, Yinghui Li, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng, Hui Wang, Hai-Tao Zheng, Philip S. Yu, Fei Huang, Jingren Zhou

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

This paper introduces a new dynamic VQA dataset and a self-adaptive planning agent called OmniSearch for multimodal retrieval augmented generation, addressing limitations of fixed retrieval strategies in existing models.

Contribution

It constructs the Dyn-VQA dataset with complex, dynamic questions and proposes OmniSearch, the first self-adaptive planning agent for flexible multimodal retrieval.

Findings

01

Existing heuristic mRAGs struggle with dynamic questions.

02

OmniSearch significantly improves retrieval relevance and adaptability.

03

The dataset and method advance multimodal question answering capabilities.

Abstract

Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs). Although promising, existing heuristic mRAGs typically predefined fixed retrieval processes, which causes two issues: (1) Non-adaptive Retrieval Queries. (2) Overloaded Retrieval Queries. However, these flaws cannot be adequately reflected by current knowledge-seeking visual question answering (VQA) datasets, since the most required knowledge can be readily obtained with a standard two-step retrieval. To bridge the dataset gap, we first construct Dyn-VQA dataset, consisting of three types of "dynamic" questions, which require complex knowledge retrieval strategies variable in query, tool, and time: (1) Questions with rapidly changing answers. (2) Questions requiring multi-modal knowledge. (3) Multi-hop questions.…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- Originality: The authors have an innovative and practical focus on the gap in existing benchmarks, especially the dynamic retrieval, and multi-modal multi-hop questions. This dataset is extensively curated manually and reflects some real-world complexities. - Clarity: The paper provides clear definitions, comprehensive descriptions of the dataset construction, and detailed experimental setups. Included some comparison with baselines and highlights the uniqueness and impact of OmniSearch. - Q

Weaknesses

- The dataset's limited size of 1.5k samples, with only 178 questions covering all three challenging categories, raises questions about whether its complexity and diversity are sufficient to benefit the broader research community. - In data curation part, the dataset only includes English and Chinese, and the authors filtered intractable instances that doesn't translate well, this might limit the dataset's diversity or introduce bias. if more languages are included, and examples are elaborated

Reviewer 02Rating 8Confidence 4

Strengths

1. They introduce a strong, novel dataset, Dyn-VQA, which offers a new, and uniquely hard multi-modal retrieval challenge for adaptable, multi-hop retrievals - which mimics real world settings well. 2. Strong experimental section - they benchmark this dataset with several MLLMs, and several types of mRAG methods. 3. They introduce the OmniSearch method which performs very well on this task. 4. The paper is mostly clearly written and well motivated.

Weaknesses

1. The dataset has a strong motivation, however, the abstract and introduction could more clearly address the concepts of (as you labeled them) (1) Non-adaptive Retrieval Queries and (2) Overloaded Retrieval Queries. Clarifying these issues up front would help position and motivate the work better - and I felt they weren't so clearly explained. 2. There could be more discussion around the scalability and computational cost of OmniSearch to provide a better sense of its applicability in real-wor

Reviewer 03Rating 6Confidence 4

Strengths

- **Introduction of a new VQA benchmark**: The authors focus on "dynamic" visual questions, where answers can change over time. This type of question is frequently encountered in real-world scenarios but is underrepresented in existing VQA datasets. The authors propose a new benchmark featuring dynamic visual questions that reflect the complexity of real-world inquiries. - **Proposal of a new mRAG approach**: The author introduces a self-adaptive retrieval agent that plan seach retrieval action

Weaknesses

- **Sustainability of Benchmark Accuracy**: Since the answers to certain questions in this benchmark may change over time, there is a risk that the answers will become **outdated** after the benchmark is publicly released. This raises concerns about how to ensure accurate model evaluation in such cases (i.e., when the benchmark's ground-truth answers no longer reflect current information). How will the benchmark address this issue to continue providing reliable, up-to-date evaluations? - **Uncl

Code & Models

Repositories

alibaba-nlp/omnisearch
pytorchOfficial

Datasets

zhzhen23/DynVQA
dataset· 81 dl
81 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Diverse Interdisciplinary Research Innovations · Advanced Computational Techniques and Applications