Russian Natural Language Generation: Creation of a Language Modelling   Dataset and Evaluation with Modern Neural Architectures

Zein Shaheen; Gerhard Wohlgenannt; Bassel Zaity; Dmitry Mouromtsev,; Vadim Pak

arXiv:2005.02470·cs.CL·May 7, 2020·1 cites

Russian Natural Language Generation: Creation of a Language Modelling Dataset and Evaluation with Modern Neural Architectures

Zein Shaheen, Gerhard Wohlgenannt, Bassel Zaity, Dmitry Mouromtsev,, Vadim Pak

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new Russian language modeling dataset and evaluates modern neural architectures like VAEs and GANs on this dataset, addressing the scarcity of resources for Russian NLP.

Contribution

It provides the first comprehensive Russian language modeling dataset and benchmarks modern neural text generation methods on it.

Findings

01

Generated text shows varying levels of grammatical correctness.

02

Perplexity scores indicate the effectiveness of models.

03

Lexical diversity varies across methods.

Abstract

Generating coherent, grammatically correct, and meaningful text is very challenging, however, it is crucial to many modern NLP systems. So far, research has mostly focused on English language, for other languages both standardized datasets, as well as experiments with state-of-the-art models, are rare. In this work, we i) provide a novel reference dataset for Russian language modeling, ii) experiment with popular modern methods for text generation, namely variational autoencoders, and generative adversarial networks, which we trained on the new dataset. We evaluate the generated text regarding metrics such as perplexity, grammatical correctness and lexical diversity.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zeinsh/lenta_short_sentences
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis