Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks
Xinxi Lyu, Michael Duan, Rulin Shao, Pang Wei Koh, Sewon Min

TL;DR
This paper demonstrates that a simple retrieval-augmented generation approach using a carefully curated, compact, web-scale datastore significantly improves performance on challenging reasoning benchmarks, surpassing complex systems.
Contribution
Introduction of CompactDS, a high-quality, web-scale datastore that enhances retrieval accuracy and efficiency, enabling minimal RAG to achieve substantial gains on reasoning-intensive benchmarks.
Findings
10-33% accuracy improvements across benchmarks
CompactDS matches or outperforms Google Search
Simple RAG pipeline surpasses complex agent-based systems
Abstract
Retrieval-augmented Generation (RAG) has primarily been studied in limited settings, such as factoid question answering; more challenging, reasoning-intensive benchmarks have seen limited success from minimal RAG. In this work, we challenge this prevailing view on established, reasoning-intensive benchmarks: MMLU, MMLU Pro, AGI Eval, GPQA, and MATH. We identify a key missing component in prior work: a usable, web-scale datastore aligned with the breadth of pretraining data. To this end, we introduce CompactDS: a diverse, high-quality, web-scale datastore that achieves high retrieval accuracy and subsecond latency on a single-node. The key insights are (1) most web content can be filtered out without sacrificing coverage, and a compact, high-quality subset is sufficient; and (2) combining in-memory approximate nearest neighbor (ANN) retrieval and on-disk exact search balances speed and…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper release a datastore for retrieval. 2. The paper includes a method to efficiently retrieve passages that reduce RAM needed. 3. The experiments are conducted on different model sizes.
## Improvements on reasoning-intensive benchmarks I am not sure if the experiments fully support the claim, especially claiming that the proposed RAG system improves **reasoning extensive** tasks. 1. MMLU and MMLU Pro are more knowledge related. These two benchmarks ask more about model's knowledge. I am not familiar with AGI Eval but from Table 10 it seems like it is MC and no CoTs are allowed when the model answers. While the model still needs to reason internally even without CoT, I think "re
- Well written and motivates why current RAG pipelines are insufficient due to lack of scope and high retrieval latency - Demonstrate improvement on a variety of benchmarks using RAG - Considers problems in actual deployment (ie on device fine grained search after coarse-grained candidate search to minimize latency) - Upper-bounding the RAG performance with the oracle baseline makes sense and demonstrates how close this “autonomous” RAG system is to a curated ICL baseline
- The paper introduces CompactDS as a means to improve reasoning performance using RAG but focuses mainly on multiple-choice question style evals that aren’t particularly reasoning intense. The MATH results are particularly interesting, but I would have liked to see the upper bound performance on this task (few shot prompting/ICL). My main concern is that the authors claim RAG helps with reasoning performance, but it seems hard to separate the boost from fact retrieval from the boost in reasonin
The paper is clearly written and presents strong empirical evidence that retrieval remains useful for reasoning tasks when the datastore is comprehensive and high quality. The motivation is well grounded in an important and practical question. The retrieval design is simple yet effective, showing that careful data construction and indexing choices can yield significant improvements over prior baselines. Experiments are systematic, covering multiple model families and evaluation settings, and the
The main limitation of the paper lies in its limited conceptual novelty. The contribution is primarily an engineering improvement through large-scale data construction rather than a new retrieval or reasoning algorithm. Although the results are strong, it remains unclear how well COMPACTDS generalizes beyond the specific reasoning benchmarks used in evaluation. Since the datastore composition is closely aligned with those benchmarks, the observed gains might partly reflect domain overlap rather
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Information Retrieval and Search Behavior · Multimodal Machine Learning Applications
