GECTurk: Grammatical Error Correction and Detection Dataset for Turkish
Atakan Kara, Farrin Marouf Sofian, Andrew Bond, G\"ozde G\"ul, \c{S}ahin

TL;DR
This paper introduces GECTurk, a large Turkish grammatical error correction dataset created through a novel synthetic data pipeline, along with baseline models and experiments demonstrating its effectiveness and transferability.
Contribution
It presents a new synthetic data generation pipeline for Turkish GEC, along with a high-quality dataset, baseline models, and extensive experiments on transferability.
Findings
Synthetic data improves GEC performance for Turkish.
Baseline models achieve strong results on in-domain data.
The dataset supports transfer learning and robustness in out-of-domain scenarios.
Abstract
Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners. Developing such tools requires a large amount of parallel, annotated data, which is unavailable for most languages. Synthetic data generation is a common practice to overcome the scarcity of such data. However, it is not straightforward for morphologically rich languages like Turkish due to complex writing rules that require phonological, morphological, and syntactic information. In this work, we present a flexible and extensible synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules (a.k.a., writing rules) implemented through complex transformation functions. Using this pipeline, we derive 130,000 high-quality parallel sentences from professionally edited articles. Additionally, we create a more realistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
