OARelatedWork: A Large-Scale Dataset of Related Work Sections with Full-texts from Open Access Sources
Martin Docekal, Martin Fajcik, Pavel Smrz

TL;DR
This paper presents OARelatedWork, a large-scale dataset of full-text related work sections and cited papers aimed at advancing automatic related work generation beyond abstracts, demonstrating significant improvements in extractive summarization and proposing a new evaluation metric.
Contribution
The creation of the first large-scale dataset with full texts for related work generation and analysis of its benefits for various summarization baselines.
Findings
Full content improves extractive summarization upper bounds by 217%.
Full texts enhance performance of multiple summarization baselines.
Proposed meta-metric correlates well with human judgment for long outputs.
Abstract
This paper introduces OARelatedWork, the first large-scale multi-document summarization dataset for related work generation containing whole related work sections and full-texts of cited papers. The dataset includes 94 450 papers and 5 824 689 unique referenced papers. It was designed for the task of automatically generating related work to shift the field toward generating entire related work sections from all available content instead of generating parts of related work sections from abstracts only, which is the current mainstream in this field for abstractive approaches. We show that the estimated upper bound for extractive summarization increases by 217% in the ROUGE-2 score, when using full content instead of abstracts. Furthermore, we show the benefits of full content data on naive, oracle, traditional, and transformer-based baselines. Long outputs, such as related work sections,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOnline Learning and Analytics
