RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding
Jiaang Li, Yifei Yuan, Wenyan Li, Mohammad Aliannejadi, Daniel Hershcovich, Anders S{\o}gaard, Ivan Vuli\'c, Wenxuan Zhang, Paul Pu Liang, Yang Deng, Serge Belongie

TL;DR
RAVENEA introduces a benchmark for evaluating multimodal retrieval-augmented models in visual culture understanding, highlighting the benefits and challenges of integrating cultural knowledge into vision-language models.
Contribution
The paper presents RAVENEA, a new benchmark dataset and evaluation framework for retrieval-augmented visual culture understanding in multimodal models.
Findings
Cultural grounding annotations improve retrieval and downstream tasks.
VLMs with culture-aware retrieval outperform non-augmented models (+6% on cVQA, +11% on cIC).
Performance varies significantly across different countries.
Abstract
As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 11,396 unique Wikipedia documents curated and ranked by human annotators. Through the extensive…
Peer Reviews
Decision·ICLR 2026 Poster
- Well-founded benchmark design with human relevance supervision. The pipeline uses GPT-4o captions → BM25 over ~6M English Wikipedia pages → human re-ranking with a clear, 3-question taxonomy (country association, topic alignment, explicit visual representation). The methodology of creation is also well outlined and easy to understand, with good documentation. - Transparent quality control and strong agreement. There is multiple evidence that supervision is reliable and that the QA process was
- Generalizability of the method. The paper emphasizes CAC and CaCLIP, which are trained and evaluated on RAVENEA. Without testing on external cultural datasets, it is difficult to claim general culture-aware retrieval superiority beyond “RAVENEA-specific specialization.” Evaluation on other datasets capturing similar concepts (e.g., CVQA, CCUB, CultureVLM, WorldCuisines, ALMBench) would strengthen the claim. - The proposed RegionScore still measures lexical proxy rather than cultural semantic
- **Timely Contribution**: The paper addresses the relatively underexplored intersection of multimodal retrieval and cultural understanding. While RAG has been effective in text-based cultural reasoning, its application to vision-language understanding remains sparse. RAVENEA directly addresses this omission and establishes a systematic evaluation framework. - **Comprehensive Experimental Design**: The evaluation encompasses seven retrievers and fourteen VLMs across multiple model families, pr
- **Limited Novelty Beyond Dataset Construction** Despite the benchmark’s quality, the methodological innovation (CAC loss and RegionScore) remains incremental. The CAC objective essentially adapts contrastive alignment to a culturally labeled setup a relatively modest technical contribution. RegionScore, while intuitive, captures surface-level lexical cues (country/demonym mentions) rather than deeper cultural semantics. - **Inadequate Relevance Annotation Scheme**: The paper employs a
- Problem Definition and Task Design The paper clearly defines the problem and presents tasks that align well with practical applications. It focuses on the scenario of “visual content + external cultural knowledge”, using two downstream tasks—cVQA (multiple-choice visual question answering) and cIC (image captioning)—to evaluate the full pipeline of retrieving → consuming knowledge → answering/generating. - Data Scale and Annotation Quality Control The dataset covers 8 countries and 11 categ
- Limited Domain Coverage and Source Bias As acknowledged by the authors in the “Limitations” section, the dataset covers only 8 countries and 11 categories, and relies primarily on English Wikipedia articles. This setup is prone to cultural and regional exposure bias, which may affect the generalizability of the claimed cross-cultural conclusions. Future work should incorporate GLAM (Galleries, Libraries, Archives, Museums) institutional collections and non-English knowledge sources to achiev
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
