BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive   Retrieval

Hongjin Su; Howard Yen; Mengzhou Xia; Weijia Shi; Niklas Muennighoff,; Han-yu Wang; Haisu Liu; Quan Shi; Zachary S. Siegel; Michael Tang; Ruoxi Sun,; Jinsung Yoon; Sercan O. Arik; Danqi Chen; Tao Yu

arXiv:2407.12883·cs.CL·March 27, 2025·3 cites

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

Hongjin Su, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff,, Han-yu Wang, Haisu Liu, Quan Shi, Zachary S. Siegel, Michael Tang, Ruoxi Sun,, Jinsung Yoon, Sercan O. Arik, Danqi Chen, Tao Yu

PDF

Open Access 1 Repo 1 Models 4 Datasets

TL;DR

BRIGHT is a new benchmark designed to evaluate retrieval models on complex, reasoning-intensive queries across diverse domains, revealing significant performance gaps and highlighting the importance of explicit reasoning and retrieval-augmented methods.

Contribution

This paper introduces BRIGHT, the first benchmark focusing on reasoning-intensive retrieval tasks with real-world queries, and demonstrates the limitations of current models while proposing reasoning-enhanced retrieval strategies.

Findings

01

State-of-the-art models perform poorly on BRIGHT

02

Explicit reasoning improves retrieval accuracy by up to 12.2 points

03

Retrieval-augmented methods enhance question-answering performance

Abstract

Existing retrieval benchmarks primarily consist of information-seeking queries (e.g., aggregated questions from search engines) where keyword or semantic-based retrieval is usually sufficient. However, many complex real-world queries require in-depth reasoning to identify relevant documents that go beyond surface form matching. For example, finding documentation for a coding question requires understanding the logic and syntax of the functions involved. To better benchmark retrieval on such challenging queries, we introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents. Our dataset consists of 1,384 real-world queries spanning diverse domains, such as economics, psychology, mathematics, and coding. These queries are drawn from naturally occurring and carefully curated human data. Extensive evaluation reveals that even…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

castorini/Anserini
none

Models

🤗
BAAI/bge-reasoner-embed-qwen3-8b-0923
model· 711 dl· ♡ 25
711 dl♡ 25

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI-based Problem Solving and Planning · Information Retrieval and Search Behavior · Semantic Web and Ontologies