SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation

Weihang Su; Anzhe Xie; Qingyao Ai; Jianming Long; Xuanyi Chen; Jiaxin Mao; Ziyi Ye; Yiqun Liu

arXiv:2508.15658·cs.CL·May 5, 2026

SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation

Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Xuanyi Chen, Jiaxin Mao, Ziyi Ye, Yiqun Liu

PDF

1 Repo

TL;DR

SurGE introduces a comprehensive benchmark and evaluation framework for scientific survey generation, addressing the lack of standardized tools and revealing current limitations of large language models in this task.

Contribution

It provides a new benchmark dataset, an automated multi-dimensional evaluation framework, and open-sources code and data to advance research in automated scientific survey generation.

Findings

01

Large language models still struggle with survey generation complexity.

02

Significant performance gap exists among current LLM-based methods.

03

The benchmark reveals areas for future improvement in survey generation.

Abstract

The rapid growth of academic literature makes the manual creation of scientific surveys increasingly infeasible. While large language models show promise for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To bridge this critical gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for scientific survey generation in computer science. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers. In addition, we propose an automated evaluation framework that measures the quality of generated surveys across four dimensions: comprehensiveness, citation accuracy, structural organization, and content quality. Our evaluation of diverse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

oneal2000/SurGE
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.