# MSRS: Evaluating Multi-Source Retrieval-Augmented Generation

**Authors:** Rohan Phanse, Yijie Zhou, Kejian Shi, Wencai Zhang, Yixin Liu, Yilun Zhao, Arman Cohan

arXiv: 2508.20867 · 2025-08-29

## TL;DR

This paper introduces a scalable framework and benchmarks for evaluating retrieval-augmented generation systems in multi-source, long-form information synthesis tasks, highlighting the importance of effective retrieval and reasoning models.

## Contribution

The authors present a new evaluation framework and two benchmarks for multi-source retrieval and synthesis, emphasizing the challenges and importance of retrieval quality and reasoning in RAG systems.

## Key findings

- Generation quality depends heavily on retrieval effectiveness.
- Multi-source synthesis remains challenging even with oracle retrieval.
- Reasoning models outperform standard LLMs in multi-source tasks.

## Abstract

Retrieval-augmented systems are typically evaluated in settings where information required to answer the query can be found within a single source or the answer is short-form or factoid-based. However, many real-world applications demand the ability to integrate and summarize information scattered across multiple sources, where no single source is sufficient to respond to the user's question. In such settings, the retrieval component of a RAG pipeline must recognize a variety of relevance signals, and the generation component must connect and synthesize information across multiple sources. We present a scalable framework for constructing evaluation benchmarks that challenge RAG systems to integrate information across distinct sources and generate long-form responses. Using our framework, we build two new benchmarks on Multi-Source Retrieval and Synthesis: MSRS-Story and MSRS-Meet, representing narrative synthesis and summarization tasks, respectively, that require retrieval from large collections. Our extensive experiments with various RAG pipelines -- including sparse and dense retrievers combined with frontier LLMs -- reveal that generation quality is highly dependent on retrieval effectiveness, which varies greatly by task. While multi-source synthesis proves challenging even in an oracle retrieval setting, we find that reasoning models significantly outperform standard LLMs at this distinct step.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20867/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20867/full.md

## References

71 references — full list in the complete paper: https://tomesphere.com/paper/2508.20867/full.md

---
Source: https://tomesphere.com/paper/2508.20867