Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soleymani Baghshah, Ehsaneddin Asgari

TL;DR
This survey comprehensively reviews Multimodal Retrieval-Augmented Generation, highlighting datasets, methodologies, challenges, and future directions for integrating multiple modalities to improve AI's factual accuracy and reasoning capabilities.
Contribution
It provides a structured analysis of Multimodal RAG systems, covering recent advancements, challenges, and open problems, serving as a foundational resource for future research.
Findings
Analyzes datasets, benchmarks, and evaluation metrics for Multimodal RAG.
Summarizes recent methodologies and innovations in retrieval and fusion techniques.
Identifies open challenges and future research directions in multimodal reasoning.
Abstract
Large Language Models (LLMs) suffer from hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information for improved factual grounding. With advances in multimodal learning, Multimodal RAG extends this approach by incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges beyond those in unimodal RAG. This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, benchmarks, metrics, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We review training strategies, robustness enhancements, loss functions, and agent-based approaches, while also exploring the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Warmup With Linear Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Attention Dropout · Byte Pair Encoding · Layer Normalization · Residual Connection · WordPiece · Linear Layer
