TL;DR
SurGE introduces a comprehensive benchmark and evaluation framework for scientific survey generation, addressing the lack of standardized tools and revealing current limitations of large language models in this task.
Contribution
It provides a new benchmark dataset, an automated multi-dimensional evaluation framework, and open-sources code and data to advance research in automated scientific survey generation.
Findings
Large language models still struggle with survey generation complexity.
Significant performance gap exists among current LLM-based methods.
The benchmark reveals areas for future improvement in survey generation.
Abstract
The rapid growth of academic literature makes the manual creation of scientific surveys increasingly infeasible. While large language models show promise for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To bridge this critical gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for scientific survey generation in computer science. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers. In addition, we propose an automated evaluation framework that measures the quality of generated surveys across four dimensions: comprehensiveness, citation accuracy, structural organization, and content quality. Our evaluation of diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
