# DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

**Authors:** Liana Patel, Negar Arabzadeh, Harshit Gupta, Ankita Sundar, Ion Stoica, Matei Zaharia, Carlos Guestrin

arXiv: 2508.20033 · 2026-02-10

## TL;DR

DeepScholar-bench is a novel live benchmark and automated evaluation framework designed to assess AI systems' ability to perform complex research synthesis tasks, such as generating related work sections from recent scientific papers.

## Contribution

The paper introduces DeepScholar-bench, a comprehensive benchmark with an open-source reference pipeline for evaluating generative research synthesis systems on real-world tasks.

## Key findings

- No existing system surpasses 31% performance across all metrics
- DeepScholar-bench reveals the significant challenge of automated research synthesis
- Provides a new foundation for advancing AI in scientific knowledge synthesis

## Abstract

The ability to research and synthesize knowledge is central to human expertise and progress. A new class of AI systems--designed for generative research synthesis--aims to automate this process by retrieving information from the live web and producing long-form, cited reports. Yet, evaluating such systems remains an open challenge: existing question-answering benchmarks focus on short, factual answers, while expert-curated datasets risk staleness and data contamination. Neither captures the complexity and evolving nature of real research synthesis tasks. We introduce DeepScholar-bench, a live benchmark and automated evaluation framework for generative research synthesis. DeepScholar-bench draws queries and human-written exemplars from recent, high-quality ArXiv papers and evaluates a real synthesis task: generating a related work section by retrieving, synthesizing, and citing prior work. Our automated framework holistically measures performance across three key dimensions--knowledge synthesis, retrieval quality, and verifiability. To further future work, we also contribute DeepScholar-ref, a simple, open-source reference pipeline, which is implemented on the LOTUS framework and provides a strong baseline. Using DeepScholar-bench, we systematically evaluate prior open-source systems, search agents with strong models, OpenAI's DeepResearch, and DeepScholar-ref. We find DeepScholar-bench is far from saturated: no system surpasses a geometric mean of $31\%$ across all metrics. These results highlight both the difficulty and importance of DeepScholar-bench as a foundation for advancing AI systems capable of generative research synthesis. We make our benchmark code and data available at https://github.com/guestrin-lab/deepscholar-bench.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20033/full.md

## Figures

30 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20033/full.md

## References

66 references — full list in the complete paper: https://tomesphere.com/paper/2508.20033/full.md

---
Source: https://tomesphere.com/paper/2508.20033