Byte-Level Grammatical Error Correction Using Synthetic and Curated Corpora
Svanhv\'it Lilja Ing\'olfsd\'ottir, P\'etur Orri Ragnarsson, Haukur, P\'all J\'onsson, Haukur Barri S\'imonarson, Vilhj\'almur {\TH}orsteinsson,, V\'esteinn Sn{\ae}bjarnarson

TL;DR
This paper compares byte-level and subword-level models for grammatical error correction, demonstrating that byte-level models achieve higher accuracy across various error types, especially when trained on synthetic data and fine-tuned on real-world errors.
Contribution
It introduces a byte-level encoding approach for GEC and shows its superiority over subword models in correcting diverse error types, especially in morphologically rich languages.
Findings
Byte-level models outperform subword models in correction quality.
Synthetic training data combined with real error fine-tuning improves performance.
The approach is effective for morphologically rich languages like Icelandic.
Abstract
Grammatical error correction (GEC) is the task of correcting typos, spelling, punctuation and grammatical issues in text. Approaching the problem as a sequence-to-sequence task, we compare the use of a common subword unit vocabulary and byte-level encoding. Initial synthetic training data is created using an error-generating pipeline, and used for finetuning two subword-level models and one byte-level model. Models are then finetuned further on hand-corrected error corpora, including texts written by children, university students, dyslexic and second-language writers, and evaluated over different error types and origins. We show that a byte-level model enables higher correction quality than a subword approach, not only for simple spelling errors, but also for more complex semantic, stylistic and grammatical issues. In particular, initial training on synthetic corpora followed by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
