Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical   Error Correction

Shirong Ma; Yinghui Li; Rongyi Sun; Qingyu Zhou; Shulin Huang; Ding; Zhang; Li Yangning; Ruiyang Liu; Zhongli Li; Yunbo Cao; Haitao Zheng; Ying; Shen

arXiv:2210.10442·cs.CL·October 20, 2022

Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction

Shirong Ma, Yinghui Li, Rongyi Sun, Qingyu Zhou, Shulin Huang, Ding, Zhang, Li Yangning, Ruiyang Liu, Zhongli Li, Yunbo Cao, Haitao Zheng, Ying, Shen

PDF

Open Access 2 Repos

TL;DR

This paper introduces a linguistic rules-based method to generate high-quality Chinese grammatical error correction data and creates a benchmark from native speaker errors, advancing CGEC research.

Contribution

It presents a novel rules-based approach for constructing large-scale training data and a real-world benchmark from native speaker errors, addressing key limitations in CGEC.

Findings

01

Generated training data improves CGEC model performance

02

Benchmark reflects errors made by native Chinese speakers

03

Method enhances the realism and quality of CGEC datasets

Abstract

Chinese Grammatical Error Correction (CGEC) is both a challenging NLP task and a common application in human daily life. Recently, many data-driven approaches are proposed for the development of CGEC research. However, there are two major limitations in the CGEC field: First, the lack of high-quality annotated training corpora prevents the performance of existing CGEC models from being significantly improved. Second, the grammatical errors in widely used test sets are not made by native Chinese speakers, resulting in a significant gap between the CGEC models and the real application. In this paper, we propose a linguistic rules-based approach to construct large-scale CGEC training corpora with automatically generated grammatical errors. Additionally, we present a challenging CGEC benchmark derived entirely from errors made by native Chinese speakers in real-world scenarios. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling

MethodsTest