Grammatical Error Correction in Low Error Density Domains: A New   Benchmark and Analyses

Simon Flachs; Oph\'elie Lacroix; Helen Yannakoudakis; Marek Rei,; Anders S{\o}gaard

arXiv:2010.07574·cs.CL·October 16, 2020

Grammatical Error Correction in Low Error Density Domains: A New Benchmark and Analyses

Simon Flachs, Oph\'elie Lacroix, Helen Yannakoudakis, Marek Rei,, Anders S{\o}gaard

PDF

TL;DR

This paper introduces CWEB, a new benchmark for grammatical error correction in website text, highlighting challenges in low error density domains and analyzing the limitations of current GEC systems.

Contribution

It presents a new dataset for GEC in website text and analyzes the difficulties of applying existing models to low error density domains.

Findings

01

Current GEC systems struggle with low error density data.

02

Language models are less effective in website text correction.

03

The new benchmark enables better evaluation across domains.

Abstract

Evaluation of grammatical error correction (GEC) systems has primarily focused on essays written by non-native learners of English, which however is only part of the full spectrum of GEC applications. We aim to broaden the target domain of GEC and release CWEB, a new benchmark for GEC consisting of website text generated by English speakers of varying levels of proficiency. Website data is a common and important domain that contains far fewer grammatical errors than learner essays, which we show presents a challenge to state-of-the-art GEC systems. We demonstrate that a factor behind this is the inability of systems to rely on a strong internal language model in low error density domains. We hope this work shall facilitate the development of open-domain GEC models that generalize to different topics and genres.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.