M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

David Anugraha; Patrick Amadeus Irawan; Anshul Singh; En-Shiun Annie Lee; Genta Indra Winata

arXiv:2512.05959·cs.CL·March 24, 2026

M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

David Anugraha, Patrick Amadeus Irawan, Anshul Singh, En-Shiun Annie Lee, Genta Indra Winata

PDF

Open Access 1 Datasets

TL;DR

M4-RAG introduces a comprehensive multilingual, multi-cultural, and multimodal benchmark for retrieval-augmented visual question answering, revealing challenges in scaling and cross-lingual performance for large vision-language models.

Contribution

It provides the first large-scale multilingual multimodal RAG benchmark with diverse languages and dialects, and systematically evaluates retrieval effectiveness across model sizes and languages.

Findings

01

RAG benefits smaller models but not larger ones.

02

Performance degrades with non-English prompts and context.

03

Large models often experience performance degradation with retrieval.

Abstract

Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark spanning 42 languages, 56 regional dialects and registers, and 189 countries, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

davidanugraha/M4-RAG
dataset· 18 dl
18 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning