uChecker: Masked Pretrained Language Models as Unsupervised Chinese Spelling Checkers
Piji Li

TL;DR
uChecker leverages masked pretrained language models with a confusion set-guided masking strategy to perform unsupervised Chinese spelling correction, addressing data scarcity and overfitting issues in low-resource settings.
Contribution
The paper introduces uChecker, a novel unsupervised framework that uses masked language models and confusion set-guided masking for Chinese spelling correction.
Findings
uChecker achieves high accuracy in spelling error detection and correction.
The model outperforms baseline methods on standard datasets.
Confusionset-guided masking improves model performance.
Abstract
The task of Chinese Spelling Check (CSC) is aiming to detect and correct spelling errors that can be found in the text. While manually annotating a high-quality dataset is expensive and time-consuming, thus the scale of the training dataset is usually very small (e.g., SIGHAN15 only contains 2339 samples for training), therefore supervised-learning based models usually suffer the data sparsity limitation and over-fitting issue, especially in the era of big language models. In this paper, we are dedicated to investigating the \textbf{unsupervised} paradigm to address the CSC problem and we propose a framework named \textbf{uChecker} to conduct unsupervised spelling error detection and correction. Masked pretrained language models such as BERT are introduced as the backbone model considering their powerful language diagnosis capability. Benefiting from the various and flexible MASKing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Second Language Acquisition and Learning
MethodsAttention Is All You Need · Linear Layer · Adam · Softmax · Residual Connection · Linear Warmup With Linear Decay · Dropout · WordPiece · Dense Connections · Weight Decay
