WikiHow: A Large Scale Text Summarization Dataset

Mahnaz Koupaee; William Yang Wang

arXiv:1810.09305·cs.CL·October 23, 2018·179 cites

WikiHow: A Large Scale Text Summarization Dataset

Mahnaz Koupaee, William Yang Wang

PDF

Open Access 5 Repos 10 Models 2 Datasets

TL;DR

WikiHow introduces a large, diverse, high-quality dataset of over 230,000 article-summary pairs from an online knowledge base, aimed at advancing abstractive summarization research.

Contribution

The paper presents WikiHow, a novel large-scale dataset with diverse writing styles, enabling more realistic and challenging training for text summarization models.

Findings

01

Existing models perform variably on WikiHow, highlighting its complexity.

02

The dataset provides a new benchmark for abstractive summarization.

03

Baseline results establish initial performance metrics.

Abstract

Sequence-to-sequence models have recently gained the state of the art performance in summarization. However, not too many large-scale high-quality datasets are available and almost all the available ones are mainly news articles with specific writing style. Moreover, abstractive human-style systems involving description of the content at a deeper level require data with higher levels of abstraction. In this paper, we present WikiHow, a dataset of more than 230,000 article and summary pairs extracted and constructed from an online knowledge base written by different human authors. The articles span a wide range of topics and therefore represent high diversity styles. We evaluate the performance of the existing methods on WikiHow to present its challenges and set some baselines to further improve it.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Text Analysis Techniques · Natural Language Processing Techniques