GECTurk: Grammatical Error Correction and Detection Dataset for Turkish

Atakan Kara; Farrin Marouf Sofian; Andrew Bond; G\"ozde G\"ul; \c{S}ahin

arXiv:2309.11346·cs.CL·September 21, 2023

GECTurk: Grammatical Error Correction and Detection Dataset for Turkish

Atakan Kara, Farrin Marouf Sofian, Andrew Bond, G\"ozde G\"ul, \c{S}ahin

PDF

Open Access 1 Repo

TL;DR

This paper introduces GECTurk, a large Turkish grammatical error correction dataset created through a novel synthetic data pipeline, along with baseline models and experiments demonstrating its effectiveness and transferability.

Contribution

It presents a new synthetic data generation pipeline for Turkish GEC, along with a high-quality dataset, baseline models, and extensive experiments on transferability.

Findings

01

Synthetic data improves GEC performance for Turkish.

02

Baseline models achieve strong results on in-domain data.

03

The dataset supports transfer learning and robustness in out-of-domain scenarios.

Abstract

Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners. Developing such tools requires a large amount of parallel, annotated data, which is unavailable for most languages. Synthetic data generation is a common practice to overcome the scarcity of such data. However, it is not straightforward for morphologically rich languages like Turkish due to complex writing rules that require phonological, morphological, and syntactic information. In this work, we present a flexible and extensible synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules (a.k.a., writing rules) implemented through complex transformation functions. Using this pipeline, we derive 130,000 high-quality parallel sentences from professionally edited articles. Additionally, we create a more realistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

GGLAB-KU/gecturk
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling