Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction
Shirong Ma, Yinghui Li, Rongyi Sun, Qingyu Zhou, Shulin Huang, Ding, Zhang, Li Yangning, Ruiyang Liu, Zhongli Li, Yunbo Cao, Haitao Zheng, Ying, Shen

TL;DR
This paper introduces a linguistic rules-based method to generate high-quality Chinese grammatical error correction data and creates a benchmark from native speaker errors, advancing CGEC research.
Contribution
It presents a novel rules-based approach for constructing large-scale training data and a real-world benchmark from native speaker errors, addressing key limitations in CGEC.
Findings
Generated training data improves CGEC model performance
Benchmark reflects errors made by native Chinese speakers
Method enhances the realism and quality of CGEC datasets
Abstract
Chinese Grammatical Error Correction (CGEC) is both a challenging NLP task and a common application in human daily life. Recently, many data-driven approaches are proposed for the development of CGEC research. However, there are two major limitations in the CGEC field: First, the lack of high-quality annotated training corpora prevents the performance of existing CGEC models from being significantly improved. Second, the grammatical errors in widely used test sets are not made by native Chinese speakers, resulting in a significant gap between the CGEC models and the real application. In this paper, we propose a linguistic rules-based approach to construct large-scale CGEC training corpora with automatically generated grammatical errors. Additionally, we present a challenging CGEC benchmark derived entirely from errors made by native Chinese speakers in real-world scenarios. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
MethodsTest
