XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages
Tahmid Hasan, Abhik Bhattacharjee, Md Saiful Islam, Kazi Samin,, Yuan-Fang Li, Yong-Bin Kang, M. Sohel Rahman, Rifat Shahriyar

TL;DR
XL-Sum is the largest multilingual abstractive summarization dataset with 1 million article-summary pairs across 44 languages, enabling advances in low-resource language summarization.
Contribution
This work introduces XL-Sum, a large-scale, high-quality dataset for multilingual summarization covering many low-resource languages, and demonstrates its effectiveness with fine-tuned mT5 models.
Findings
Achieved over 11 ROUGE-2 scores on 10 languages
Surpassed 15 ROUGE-2 scores in some languages with multilingual training
Training on low-resource languages yields competitive results
Abstract
Contemporary works on abstractive text summarization have focused primarily on high-resource languages like English, mostly due to the limited availability of datasets for low/mid-resource ones. In this work, we present XL-Sum, a comprehensive and diverse dataset comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation. We fine-tune mT5, a state-of-the-art pretrained multilingual model, with XL-Sum and experiment on multilingual and low-resource summarization tasks. XL-Sum induces competitive results compared to the ones obtained using similar monolingual datasets: we show higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Dense Connections · Attention Dropout · Gated Linear Unit · SentencePiece · Dropout
