Dataset for Automatic Summarization of Russian News
Ilya Gusev

TL;DR
This paper introduces Gazeta, the first Russian news summarization dataset, and evaluates various models, demonstrating its usefulness and the effectiveness of pretrained mBART for Russian text summarization.
Contribution
The paper presents Gazeta, the first dedicated dataset for Russian news summarization, and benchmarks multiple models including pretrained mBART, establishing a new resource for the field.
Findings
Gazeta is a valid dataset for Russian summarization tasks.
Pretrained mBART performs well on Russian summarization.
Benchmark results provide a baseline for future research.
Abstract
Automatic text summarization has been studied in a variety of domains and languages. However, this does not hold for the Russian language. To overcome this issue, we present Gazeta, the first dataset for summarization of Russian news. We describe the properties of this dataset and benchmark several extractive and abstractive models. We demonstrate that the dataset is a valid task for methods of text summarization for Russian. Additionally, we prove the pretrained mBART model to be useful for Russian text summarization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsmBART
