MM-BRIGHT: A Multi-Task Multimodal Benchmark for Reasoning-Intensive Retrieval

Abdelrahman Abdallah; Mohamed Darwish Mounis; Mahmoud Abdalla; Mahmoud SalahEldin Kasem; Mostafa Farouk Senussi; Mohamed Mahmoud; Mohammed Ali; Adam Jatowt; Hyun-Soo Kang

arXiv:2601.09562·cs.IR·January 16, 2026

MM-BRIGHT: A Multi-Task Multimodal Benchmark for Reasoning-Intensive Retrieval

Abdelrahman Abdallah, Mohamed Darwish Mounis, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mostafa Farouk Senussi, Mohamed Mahmoud, Mohammed Ali, Adam Jatowt, Hyun-Soo Kang

PDF

Open Access

TL;DR

MM-BRIGHT is a new multimodal benchmark designed to evaluate reasoning-intensive retrieval across diverse technical domains, revealing current models' significant limitations and highlighting the need for advanced visual reasoning capabilities.

Contribution

This paper introduces MM-BRIGHT, the first comprehensive multimodal retrieval benchmark with real-world queries, multiple tasks, and diverse domains, to challenge and advance retrieval models.

Findings

01

State-of-the-art models perform poorly on MM-BRIGHT tasks.

02

BM25 achieves only 8.5 nDCG@10 on text-only retrieval.

03

Multimodal models like Nomic-Vision underperform compared to text-only models.

Abstract

Existing retrieval benchmarks primarily consist of text-based queries where keyword or semantic matching is usually sufficient. Many real-world queries contain multimodal elements, particularly, images such as diagrams, charts, and screenshots that require intensive reasoning to identify relevant documents. To address this gap, we introduce MM-BRIGHT, the first multimodal benchmark for reasoning-intensive retrieval. Our dataset consists of 2,803 real-world queries spanning 29 diverse technical domains, with four tasks of increasing complexity: text-to-text, multimodal-to-text, multimodal-to-image, and multimodal-to-multimodal retrieval. Extensive evaluation reveals that state-of-the-art models struggle across all tasks: BM25 achieves only 8.5 nDCG@10 on text-only retrieval, while the best multimodal model Nomic-Vision reaches just 27.6 nDCG@10 on multimodal-to-text retrieval actually…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Information Retrieval and Search Behavior