WikiHow: A Large Scale Text Summarization Dataset
Mahnaz Koupaee, William Yang Wang

TL;DR
WikiHow introduces a large, diverse, high-quality dataset of over 230,000 article-summary pairs from an online knowledge base, aimed at advancing abstractive summarization research.
Contribution
The paper presents WikiHow, a novel large-scale dataset with diverse writing styles, enabling more realistic and challenging training for text summarization models.
Findings
Existing models perform variably on WikiHow, highlighting its complexity.
The dataset provides a new benchmark for abstractive summarization.
Baseline results establish initial performance metrics.
Abstract
Sequence-to-sequence models have recently gained the state of the art performance in summarization. However, not too many large-scale high-quality datasets are available and almost all the available ones are mainly news articles with specific writing style. Moreover, abstractive human-style systems involving description of the content at a deeper level require data with higher levels of abstraction. In this paper, we present WikiHow, a dataset of more than 230,000 article and summary pairs extracted and constructed from an online knowledge base written by different human authors. The articles span a wide range of topics and therefore represent high diversity styles. We evaluate the performance of the existing methods on WikiHow to present its challenges and set some baselines to further improve it.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗sentence-transformers/all-MiniLM-L6-v2model· 200.9M dl· ♡ 4639200.9M dl♡ 4639
- 🤗sentence-transformers/all-mpnet-base-v2model· 28.7M dl· ♡ 126628.7M dl♡ 1266
- 🤗Hum-Works/lodestone-base-4096-v1model· 112 dl· ♡ 12112 dl♡ 12
- 🤗arredondos/my_sentence_transformermodel· 1 dl1 dl
- 🤗flax-sentence-embeddings/all_datasets_v3_MiniLM-L12model· 5 dl· ♡ 25 dl♡ 2
- 🤗flax-sentence-embeddings/all_datasets_v3_MiniLM-L6model· 3 dl3 dl
- 🤗flax-sentence-embeddings/all_datasets_v3_distilroberta-basemodel· 1 dl· ♡ 21 dl♡ 2
- 🤗flax-sentence-embeddings/all_datasets_v3_mpnet-basemodel· 596 dl· ♡ 13596 dl♡ 13
- 🤗flax-sentence-embeddings/all_datasets_v3_roberta-largemodel· 24 dl· ♡ 1324 dl♡ 13
- 🤗flax-sentence-embeddings/all_datasets_v4_MiniLM-L12model· 2 dl· ♡ 22 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Natural Language Processing Techniques
