Evaluation of retrieval-based QA on QUEST-LOFT
Nathan Scales, Nathanael Sch\"arli, Olivier Bousquet

TL;DR
This paper analyzes the limitations of retrieval-augmented generation for complex, multi-document questions, and demonstrates how structured output formats and re-verification can significantly improve performance on the QUEST-LOFT benchmark.
Contribution
It provides an in-depth analysis of RAG's challenges on QUEST-LOFT and proposes optimization strategies involving structured reasoning and evidence verification.
Findings
RAG performance improves with structured output formats.
Human evaluation reveals more accurate performance metrics.
Optimized RAG outperforms long-context models on QUEST-LOFT.
Abstract
Despite the popularity of retrieval-augmented generation (RAG) as a solution for grounded QA in both academia and industry, current RAG methods struggle with questions where the necessary information is distributed across many documents or where retrieval needs to be combined with complex reasoning. Recently, the LOFT study has shown that this limitation also applies to approaches based on long-context language models, with the QUEST benchmark exhibiting particularly large headroom. In this paper, we provide an in-depth analysis of the factors contributing to the poor performance on QUEST-LOFT, publish updated numbers based on a thorough human evaluation, and demonstrate that RAG can be optimized to significantly outperform long-context approaches when combined with a structured output format containing reasoning and evidence, optionally followed by answer re-verification.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Biomedical Text Mining and Ontologies
