Russian Natural Language Generation: Creation of a Language Modelling Dataset and Evaluation with Modern Neural Architectures
Zein Shaheen, Gerhard Wohlgenannt, Bassel Zaity, Dmitry Mouromtsev,, Vadim Pak

TL;DR
This paper introduces a new Russian language modeling dataset and evaluates modern neural architectures like VAEs and GANs on this dataset, addressing the scarcity of resources for Russian NLP.
Contribution
It provides the first comprehensive Russian language modeling dataset and benchmarks modern neural text generation methods on it.
Findings
Generated text shows varying levels of grammatical correctness.
Perplexity scores indicate the effectiveness of models.
Lexical diversity varies across methods.
Abstract
Generating coherent, grammatically correct, and meaningful text is very challenging, however, it is crucial to many modern NLP systems. So far, research has mostly focused on English language, for other languages both standardized datasets, as well as experiments with state-of-the-art models, are rare. In this work, we i) provide a novel reference dataset for Russian language modeling, ii) experiment with popular modern methods for text generation, namely variational autoencoders, and generative adversarial networks, which we trained on the new dataset. We evaluate the generated text regarding metrics such as perplexity, grammatical correctness and lexical diversity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
