CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+ Language Pairs
Abhik Bhattacharjee, Tahmid Hasan, Wasi Uddin Ahmad, Yuan-Fang Li,, Yong-Bin Kang, Rifat Shahriyar

TL;DR
CrossSum introduces the largest non-English-centric cross-lingual summarization dataset with 1.68 million samples across 1,500+ language pairs, enabling new research in multilingual summarization.
Contribution
It provides a novel large-scale dataset, a multistage sampling algorithm, and a new evaluation metric, LaSE, for cross-lingual summarization beyond English.
Findings
Our model outperforms baselines on ROUGE and LaSE metrics.
LaSE correlates strongly with ROUGE and works without reference summaries.
CrossSum is the first extensive non-English-centric cross-lingual summarization dataset.
Abstract
We present CrossSum, a large-scale cross-lingual summarization dataset comprising 1.68 million article-summary samples in 1,500+ language pairs. We create CrossSum by aligning parallel articles written in different languages via cross-lingual retrieval from a multilingual abstractive summarization dataset and perform a controlled human evaluation to validate its quality. We propose a multistage data sampling algorithm to effectively train a cross-lingual summarization model capable of summarizing an article in any target language. We also introduce LaSE, an embedding-based metric for automatically evaluating model-generated summaries. LaSE is strongly correlated with ROUGE and, unlike ROUGE, can be reliably measured even in the absence of references in the target language. Performance on ROUGE and LaSE indicate that our proposed model consistently outperforms baseline models. To the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗csebuetnlp/mT5_m2o_english_crossSummodel· 143 dl· ♡ 5143 dl♡ 5
- 🤗csebuetnlp/mT5_m2m_crossSummodel· 27 dl· ♡ 827 dl♡ 8
- 🤗csebuetnlp/mT5_m2o_hindi_crossSummodel· 4 dl4 dl
- 🤗csebuetnlp/mT5_m2o_arabic_crossSummodel· 2.3k dl· ♡ 32.3k dl♡ 3
- 🤗csebuetnlp/mT5_m2o_russian_crossSummodel· 11 dl· ♡ 311 dl♡ 3
- 🤗csebuetnlp/mT5_m2o_chinese_simplified_crossSummodel· 16 dl· ♡ 2016 dl♡ 20
- 🤗csebuetnlp/mT5_m2m_crossSum_enhancedmodel· 182 dl· ♡ 11182 dl♡ 11
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Layer Normalization · SentencePiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Dropout · Gated Linear Unit
