CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Jiaqi Wang; Xiao Yang; Kai Sun; Parth Suresh; Sanat Sharma; Adam Czyzewski; Derek Andersen; Surya Appini; Arkav Banerjee; Sajal Choudhary; Shervin Ghasemlou; Ziqiang Guan; Akil Iyer; Haidar Khan; Lingkun Kong; Roy Luo; Tiffany Ma; Zhen Qiao; David Tran; Wenfang Xu; Skyler Yeatman; Chen Zhou; Gunveer Gujral; Yinglong Xia; Shane Moon; Nicolas Scheffer; Nirav Shah; Eun Chang; Yue Liu; Florian Metze; Tammy Stark; Zhaleh Feizollahi; Andrea Jessee; Mangesh Pujari; Ahmed Aly; Babak Damavandi; Rakesh Wanga; Anuj Kumar; Rohit Patel; Wen-tau Yih; Xin Luna Dong

arXiv:2510.26160·cs.CV·October 31, 2025

CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Jiaqi Wang, Xiao Yang, Kai Sun, Parth Suresh, Sanat Sharma, Adam Czyzewski, Derek Andersen, Surya Appini, Arkav Banerjee, Sajal Choudhary, Shervin Ghasemlou, Ziqiang Guan, Akil Iyer, Haidar Khan, Lingkun Kong, Roy Luo, Tiffany Ma, Zhen Qiao, David Tran, Wenfang Xu

PDF

1 Datasets 3 Reviews

TL;DR

CRAG-MM is a comprehensive benchmark for multi-modal, multi-turn question answering in wearable device scenarios, highlighting the challenges and providing a platform for advancing retrieval-augmented generation methods.

Contribution

It introduces a large, diverse dataset and evaluation framework for multi-modal multi-turn QA, specifically tailored for wearable device contexts, filling a significant research gap.

Findings

01

Current RAG methods achieve around 32-43% truthfulness on the benchmark.

02

State-of-the-art solutions perform similarly to baseline, indicating room for improvement.

03

The benchmark has driven significant community engagement and solution development.

Abstract

Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

The paper propose an interesting benchmark for multi-modal RAG specifically useful for situations pertaining to wearables, which are becoming more and more common place. This benchmark is extremely relevant for the current and future developments of the field. Moreover, the design of the question types, tasks and the data quality feeding the RAG systems, makes the benchmark realistic, hence making the benchmark useful and reflective of the true performance of the systems evaluated with it in rea

Weaknesses

While the benchmark is well motivated and the rationale behind the dataset design and data types is clearly presented, important details are missing regarding how the dataset was actually created and how the quality of its questions was ensured. I encourage the authors to consult works such as MMQA and SMMQG, which provide comprehensive documentation of their data collection and validation processes, including the use of crowd-sourced annotators, inter-annotator agreement checks, and bias mitiga

Reviewer 02Rating 2Confidence 4

Strengths

1. The synthetic benchmark dataset covers more conversation dynamics than prior benchmarks for factual question answering as shown in Figure 1 (See more in the weakness). Also, it's good that most images from this benchmarks are collected from real-human using egocentric wearable headset. 2. It's interesting that the paper targets wearable AI use cases with low-quality images. This setting is different from a lot of other relevant benchmarks.

Weaknesses

1. Benchmark Dataset Design - The reviewer gets confused if it's a RAG benchmark, a search-augmented benchmark, or a long-context benchmark? Based on the description in Section 2 and Section 4.1, It seems that the authors provide an API function (tool) for these VLM to use and always assume the model would use it. If that's the case, it's more like a long-context QA benchmarks cuz the search part is fixed now. If not, it can be a search-augmented where the models might be able to decide whether

Reviewer 03Rating 6Confidence 2

Strengths

1) Rich conversational coverage: Includes 2 K multi-turn conversations, ∼38 % of which involve domain shifts, realistically simulating natural topic drift. 2) Real-world visual realism: Contains 7.9 K images, with 79 % egocentric, capturing wearable AI’s inherent visual challenges (wide-angle, occlusion, low light). 3) Comprehensive evaluation: GPT-5 achieves 63 % (single-turn) and 70 % (multi-turn) accuracy, revealing a measurable gap and potential for improvement on MM-RAG. 4) Community imp

Weaknesses

Ethical and safety concerns: The dataset involves vendors wearing smart glasses in daily contexts. While this enables realism, it raises potential privacy and identifiability risks for bystanders. Clearer documentation of anonymization and consent protocols is needed.

Code & Models

Datasets

kaimeta/wearables_benchmarks
dataset· 26 dl
26 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.