TL;DR
This paper presents a method for generating realistic grammatical errors using sequence-to-sequence models to create synthetic training data, significantly improving grammatical error detection performance.
Contribution
It introduces a cost-effective approach to generate high-quality synthetic errors, enhancing grammatical error detection models beyond previous state-of-the-art results.
Findings
Synthetic error data improves detection accuracy
Achieves over 5% $F_{0.5}$ score improvement
Generated errors are mostly human-like
Abstract
Grammatical error correction, like other machine learning tasks, greatly benefits from large quantities of high quality training data, which is typically expensive to produce. While writing a program to automatically generate realistic grammatical errors would be difficult, one could learn the distribution of naturallyoccurring errors and attempt to introduce them into other datasets. Initial work on inducing errors in this way using statistical machine translation has shown promise; we investigate cheaply constructing synthetic samples, given a small corpus of human-annotated data, using an off-the-rack attentive sequence-to-sequence model and a straight-forward post-processing procedure. Our approach yields error-filled artificial data that helps a vanilla bi-directional LSTM to outperform the previous state of the art at grammatical error detection, and a previously introduced model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
