RuCoLA: Russian Corpus of Linguistic Acceptability
Vladislav Mikhailov, Tatiana Shamardina, Max Ryabinin, Alena Pestova,, Ivan Smurov, Ekaterina Artemova

TL;DR
RuCoLA is a new high-quality Russian linguistic acceptability dataset designed to evaluate and improve language models' grammatical and semantic understanding in Russian, including in out-of-domain text generation scenarios.
Contribution
This paper introduces RuCoLA, the first comprehensive Russian acceptability corpus with in-domain and out-of-domain data, and provides baseline experiments highlighting current language models' limitations.
Findings
Language models lag behind humans in acceptability detection.
Models struggle with morphological and semantic errors.
RuCoLA enables benchmarking of Russian language model competence.
Abstract
Linguistic acceptability (LA) attracts the attention of the research community due to its many uses, such as testing the grammatical knowledge of language models and filtering implausible texts with acceptability classifiers. However, the application scope of LA in languages other than English is limited due to the lack of high-quality resources. To this end, we introduce the Russian Corpus of Linguistic Acceptability (RuCoLA), built from the ground up under the well-established binary LA approach. RuCoLA consists of k in-domain sentences from linguistic publications and k out-of-domain sentences produced by generative models. The out-of-domain set is created to facilitate the practical use of acceptability for improving language generation. Our paper describes the data collection protocol and presents a fine-grained analysis of acceptability classification experiments with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Natural Language Processing Techniques · Topic Modeling
