NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from   Native Speaker Texts

Yue Zhang; Bo Zhang; Haochen Jiang; Zhenghua Li; Chen Li; Fei Huang,; Min Zhang

arXiv:2305.16023·cs.CL·May 26, 2023·2 cites

NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts

Yue Zhang, Bo Zhang, Haochen Jiang, Zhenghua Li, Chen Li, Fei Huang,, Min Zhang

PDF

Open Access 1 Repo 5 Models

TL;DR

NaSGEC is a new multi-domain Chinese grammatical error correction dataset derived from native speaker texts across social media, scientific writing, and exams, aiming to advance cross-domain GEC research.

Contribution

The paper introduces NaSGEC, a multi-domain Chinese GEC dataset with multiple references, and provides benchmark results and domain analysis to support cross-domain GEC research.

Findings

01

NaSGEC covers 12,500 sentences from three native domains.

02

Benchmark results demonstrate the effectiveness of current CGEC models on NaSGEC.

03

Analysis reveals domain gaps and connections in Chinese GEC.

Abstract

We introduce NaSGEC, a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains. Previous CGEC research primarily focuses on correcting texts from a single domain, especially learner essays. To broaden the target domain, we annotate multiple references for 12,500 sentences from three native domains, i.e., social media, scientific writing, and examination. We provide solid benchmark results for NaSGEC by employing cutting-edge CGEC models and different training data. We further perform detailed analyses of the connections and gaps between our domains from both empirical and statistical views. We hope this work can inspire future studies on an important but under-explored direction--cross-domain GEC.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hillzhang1999/nasgec
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling