WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive   Summarization

Faisal Ladhak; Esin Durmus; Claire Cardie; Kathleen McKeown

arXiv:2010.03093·cs.CL·October 8, 2020

WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization

Faisal Ladhak, Esin Durmus, Claire Cardie, Kathleen McKeown

PDF

1 Repo 1 Datasets

TL;DR

WikiLingua is a large, multilingual dataset derived from WikiHow, designed to evaluate and improve cross-lingual abstractive summarization systems through novel alignments and a new direct summarization method.

Contribution

The paper introduces WikiLingua, a new multilingual dataset with aligned article-summary pairs across 18 languages, and proposes a cost-efficient direct cross-lingual summarization method leveraging synthetic data and neural translation.

Findings

01

Existing methods perform poorly on the new dataset.

02

The proposed method outperforms baseline approaches significantly.

03

The approach reduces inference costs compared to translation-based methods.

Abstract

We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems. We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors. We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article. As a set of baselines for further studies, we evaluate the performance of existing cross-lingual abstractive summarization methods on our dataset. We further propose a method for direct crosslingual summarization (i.e., without requiring translation at inference time) by leveraging synthetic data and Neural Machine Translation as a pre-training step. Our method significantly outperforms the baseline approaches, while being more cost…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

esdurmus/Wikilingua
noneOfficial

Datasets

esdurmus/wiki_lingua
dataset· 1.2k dl
1.2k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.