RuCoLA: Russian Corpus of Linguistic Acceptability

Vladislav Mikhailov; Tatiana Shamardina; Max Ryabinin; Alena Pestova,; Ivan Smurov; Ekaterina Artemova

arXiv:2210.12814·cs.CL·October 4, 2023·1 cites

RuCoLA: Russian Corpus of Linguistic Acceptability

Vladislav Mikhailov, Tatiana Shamardina, Max Ryabinin, Alena Pestova,, Ivan Smurov, Ekaterina Artemova

PDF

Open Access 1 Repo 1 Datasets

TL;DR

RuCoLA is a new high-quality Russian linguistic acceptability dataset designed to evaluate and improve language models' grammatical and semantic understanding in Russian, including in out-of-domain text generation scenarios.

Contribution

This paper introduces RuCoLA, the first comprehensive Russian acceptability corpus with in-domain and out-of-domain data, and provides baseline experiments highlighting current language models' limitations.

Findings

01

Language models lag behind humans in acceptability detection.

02

Models struggle with morphological and semantic errors.

03

RuCoLA enables benchmarking of Russian language model competence.

Abstract

Linguistic acceptability (LA) attracts the attention of the research community due to its many uses, such as testing the grammatical knowledge of language models and filtering implausible texts with acceptability classifiers. However, the application scope of LA in languages other than English is limited due to the lack of high-quality resources. To this end, we introduce the Russian Corpus of Linguistic Acceptability (RuCoLA), built from the ground up under the well-established binary LA approach. RuCoLA consists of $9.8$ k in-domain sentences from linguistic publications and $3.6$ k out-of-domain sentences produced by generative models. The out-of-domain set is created to facilitate the practical use of acceptability for improving language generation. Our paper describes the data collection protocol and presents a fine-grained analysis of acceptability classification experiments with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

russiannlp/rucola
pytorchOfficial

Datasets

RussianNLP/rucola
dataset· 592 dl
592 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Natural Language Processing Techniques · Topic Modeling