Detecting Spelling and Grammatical Anomalies in Russian Poetry Texts
Ilya Koziev

TL;DR
This paper introduces methods for detecting spelling and grammatical anomalies in Russian poetry texts to improve dataset quality for generative models, including a new dataset and comprehensive evaluation of detection approaches.
Contribution
It presents a comparison of unsupervised and supervised anomaly detection methods, introduces the RUPOR dataset for Russian poetry, and offers tools for enhancing training data quality.
Findings
Supervised methods outperform unsupervised in anomaly detection accuracy.
The RUPOR dataset enables effective cross-sentence grammatical error detection.
Automated detection improves the quality of training datasets for creative language models.
Abstract
The quality of natural language texts in fine-tuning datasets plays a critical role in the performance of generative models, particularly in computational creativity tasks such as poem or song lyric generation. Fluency defects in generated poems significantly reduce their value. However, training texts are often sourced from internet-based platforms without stringent quality control, posing a challenge for data engineers to manage defect levels effectively. To address this issue, we propose the use of automated linguistic anomaly detection to identify and filter out low-quality texts from training datasets for creative models. In this paper, we present a comprehensive comparison of unsupervised and supervised text anomaly detection approaches, utilizing both synthetic and human-labeled datasets. We also introduce the RUPOR dataset, a collection of Russian-language human-labeled poems…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Topic Modeling · Authorship Attribution and Profiling
