StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems
Aryan Patodiya

TL;DR
StratRAG is a new open-source dataset designed to evaluate multi-hop retrieval in RAG systems under realistic noisy conditions, with benchmark results showing hybrid retrieval's superior performance.
Contribution
The paper introduces StratRAG, a challenging multi-hop retrieval dataset derived from HotpotQA, and provides benchmark results for various retrieval strategies.
Findings
Hybrid retrieval outperforms BM25 and dense methods in overall performance.
Bridge questions are notably more difficult for retrieval strategies.
StratRAG is publicly available for further research.
Abstract
We introduce StratRAG, an open-source retrieval evaluation dataset for benchmarking Retrieval-Augmented Generation (RAG) systems on multi-hop reasoning tasks under realistic, noisy document-pool conditions. Derived from HotpotQA (distractor setting), StratRAG comprises 2,200 examples across three question types -- bridge, comparison, and yes-no -- each paired with a pool of 15 candidate documents containing exactly 2 gold documents and 13 topically related distractors. We benchmark three retrieval strategies -- BM25, dense retrieval (all-MiniLM-L6-v2), and hybrid fusion -- reporting Recall@k, MRR, and NDCG@5 on the validation set. Hybrid retrieval achieves the best overall performance (Recall@2 = 0.70, MRR = 0.93), yet bridge questions remain substantially harder (Recall@2 = 0.67), motivating future work on reinforcement-learning-based retrieval policies. StratRAG is publicly available…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
