The State and Fate of Summarization Datasets: A Survey

Noam Dahan; Gabriel Stanovsky

arXiv:2411.04585·cs.CL·February 12, 2025

The State and Fate of Summarization Datasets: A Survey

Noam Dahan, Gabriel Stanovsky

PDF

Open Access 1 Repo 1 Video

TL;DR

This survey analyzes 133 summarization datasets across many languages, highlighting issues like lack of resources for low-resource languages and over-reliance on news data, while providing tools to improve dataset understanding and research coherence.

Contribution

It introduces a comprehensive ontology for summarization datasets, analyzes dataset properties, and offers tools to facilitate better dataset discovery and research standardization.

Findings

01

Limited high-quality datasets for low-resource languages

02

Over-reliance on news and distant supervision datasets

03

Provided a web interface and data card template for dataset exploration

Abstract

Automatic summarization has consistently attracted attention due to its versatility and wide application in various downstream tasks. Despite its popularity, we find that annotation efforts have largely been disjointed, and have lacked common terminology. Consequently, it is challenging to discover existing resources or identify coherent research directions. To address this, we survey a large body of work spanning 133 datasets in over 100 languages, creating a novel ontology covering sample properties, collection methods and distribution. With this ontology we make key observations, including the lack in accessible high-quality datasets for low-resource languages, and the field's over-reliance on the news domain and on automatically collected distant supervision. Finally, we make available a web interface that allows users to interact and explore our ontology and dataset collection, as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

edahanoam/Awesome-Summarization-Datasets
noneOfficial

Videos

The State and Fate of Summarization Datasets: A Survey· underline

Taxonomy

TopicsTopic Modeling

MethodsOntology