Dataset for Automatic Summarization of Russian News

Ilya Gusev

arXiv:2006.11063·cs.CL·October 6, 2021

Dataset for Automatic Summarization of Russian News

Ilya Gusev

PDF

2 Repos 4 Models 2 Datasets

TL;DR

This paper introduces Gazeta, the first Russian news summarization dataset, and evaluates various models, demonstrating its usefulness and the effectiveness of pretrained mBART for Russian text summarization.

Contribution

The paper presents Gazeta, the first dedicated dataset for Russian news summarization, and benchmarks multiple models including pretrained mBART, establishing a new resource for the field.

Findings

01

Gazeta is a valid dataset for Russian summarization tasks.

02

Pretrained mBART performs well on Russian summarization.

03

Benchmark results provide a baseline for future research.

Abstract

Automatic text summarization has been studied in a variety of domains and languages. However, this does not hold for the Russian language. To overcome this issue, we present Gazeta, the first dataset for summarization of Russian news. We describe the properties of this dataset and benchmark several extractive and abstractive models. We demonstrate that the dataset is a valid task for methods of text summarization for Russian. Additionally, we prove the pretrained mBART model to be useful for Russian text summarization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsmBART