WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization
Faisal Ladhak, Esin Durmus, Claire Cardie, Kathleen McKeown

TL;DR
WikiLingua is a large, multilingual dataset derived from WikiHow, designed to evaluate and improve cross-lingual abstractive summarization systems through novel alignments and a new direct summarization method.
Contribution
The paper introduces WikiLingua, a new multilingual dataset with aligned article-summary pairs across 18 languages, and proposes a cost-efficient direct cross-lingual summarization method leveraging synthetic data and neural translation.
Findings
Existing methods perform poorly on the new dataset.
The proposed method outperforms baseline approaches significantly.
The approach reduces inference costs compared to translation-based methods.
Abstract
We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems. We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors. We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article. As a set of baselines for further studies, we evaluate the performance of existing cross-lingual abstractive summarization methods on our dataset. We further propose a method for direct crosslingual summarization (i.e., without requiring translation at inference time) by leveraging synthetic data and Neural Machine Translation as a pre-training step. Our method significantly outperforms the baseline approaches, while being more cost…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
