EdaCSC: Two Easy Data Augmentation Methods for Chinese Spelling   Correction

Lei Sheng; Shuai-Shuai Xu

arXiv:2409.05105·cs.CL·September 10, 2024

EdaCSC: Two Easy Data Augmentation Methods for Chinese Spelling Correction

Lei Sheng, Shuai-Shuai Xu

PDF

Open Access 1 Repo

TL;DR

This paper introduces two simple data augmentation techniques for Chinese Spelling Correction that improve model robustness and achieve state-of-the-art results on benchmark datasets.

Contribution

The paper proposes two novel data augmentation methods for Chinese Spelling Correction that outperform existing model-centric approaches.

Findings

01

Achieved state-of-the-art performance on SIGHAN15 test set.

02

Enhanced model robustness against sentences with multiple typos.

03

Outperformed most existing models on benchmark datasets.

Abstract

Chinese Spelling Correction (CSC) aims to detect and correct spelling errors in Chinese sentences caused by phonetic or visual similarities. While current CSC models integrate pinyin or glyph features and have shown significant progress,they still face challenges when dealing with sentences containing multiple typos and are susceptible to overcorrection in real-world scenarios. In contrast to existing model-centric approaches, we propose two data augmentation methods to address these limitations. Firstly, we augment the dataset by either splitting long sentences into shorter ones or reducing typos in sentences with multiple typos. Subsequently, we employ different training processes to select the optimal model. Experimental evaluations on the SIGHAN benchmarks demonstrate the superiority of our approach over most existing models, achieving state-of-the-art performance on the SIGHAN15…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cycloneboy/csc_eda
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques