A Large-Scale Multi-Document Summarization Dataset from the Wikipedia   Current Events Portal

Demian Gholipour Ghalandari; Chris Hokamp; Nghia The Pham; John; Glover; Georgiana Ifrim

arXiv:2005.10070·cs.CL·May 21, 2020

A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal

Demian Gholipour Ghalandari, Chris Hokamp, Nghia The Pham, John, Glover, Georgiana Ifrim

PDF

1 Repo 2 Datasets

TL;DR

This paper introduces a large-scale multi-document summarization dataset derived from Wikipedia's Current Events Portal, enabling better training of models for real-world news summarization tasks.

Contribution

The work creates a new, extensive dataset for multi-document summarization using Wikipedia and Common Crawl, facilitating research on large-scale, realistic summarization.

Findings

01

State-of-the-art MDS techniques evaluated on the dataset

02

Dataset covers diverse news events with high-quality summaries

03

Empirical analysis highlights challenges and opportunities in MDS

Abstract

Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries and has important applications in story clustering for newsfeeds, presentation of search results, and timeline generation. However, there is a lack of datasets that realistically address such use cases at a scale large enough for training supervised models for this task. This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters. We build this dataset by leveraging the Wikipedia Current Events Portal (WCEP), which provides concise and neutral human-written summaries of news events, with links to external source articles. We also automatically extend these source articles by looking for related articles in the Common Crawl archive. We provide a quantitative analysis of the dataset and empirical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

complementizer/wcep-mds-dataset
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.