Models and Datasets for Cross-Lingual Summarisation
Laura Perez-Beltrachini, Mirella Lapata

TL;DR
This paper introduces a new cross-lingual summarisation dataset covering twelve language pairs, derived from Wikipedia, and evaluates multilingual models across various scenarios, advancing research in multilingual NLP.
Contribution
The paper provides a novel multilingual cross-lingual summarisation dataset and analysis methodology, applicable to multiple languages and scenarios, with experimental validation using pre-trained models.
Findings
Effective cross-lingual summarisation achieved with multilingual models.
Dataset enables evaluation in supervised, zero-shot, and out-of-domain settings.
Human validation confirms dataset quality and task relevance.
Abstract
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language. The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German, and the methodology for its creation can be applied to several other languages. We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles. We analyse the proposed cross-lingual summarisation task with automatic metrics and validate it with a human study. To illustrate the utility of our dataset we report experiments with multi-lingual pre-trained models in supervised, zero- and few-shot, and out-of-domain scenarios.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
