Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

Mohammad Mahdi Abootorabi; Amirhosein Zobeiri; Mahdi Dehghani; Mohammadali Mohammadkhani; Bardia Mohammadi; Omid Ghahroodi; Mahdieh Soleymani Baghshah; Ehsaneddin Asgari

arXiv:2502.08826·cs.CL·June 3, 2025

Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soleymani Baghshah, Ehsaneddin Asgari

PDF

Open Access 1 Repo 1 Video

TL;DR

This survey comprehensively reviews Multimodal Retrieval-Augmented Generation, highlighting datasets, methodologies, challenges, and future directions for integrating multiple modalities to improve AI's factual accuracy and reasoning capabilities.

Contribution

It provides a structured analysis of Multimodal RAG systems, covering recent advancements, challenges, and open problems, serving as a foundational resource for future research.

Findings

01

Analyzes datasets, benchmarks, and evaluation metrics for Multimodal RAG.

02

Summarizes recent methodologies and innovations in retrieval and fusion techniques.

03

Identifies open challenges and future research directions in multimodal reasoning.

Abstract

Large Language Models (LLMs) suffer from hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information for improved factual grounding. With advances in multimodal learning, Multimodal RAG extends this approach by incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges beyond those in unimodal RAG. This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, benchmarks, metrics, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We review training strategies, robustness enhancements, loss functions, and agent-based approaches, while also exploring the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llm-lab-org/multimodal-rag-survey
noneOfficial

Videos

Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation· underline

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Warmup With Linear Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Attention Dropout · Byte Pair Encoding · Layer Normalization · Residual Connection · WordPiece · Linear Layer