Rethinking Masked Language Modeling for Chinese Spelling Correction

Hongqiu Wu; Shaohua Zhang; Yuchen Zhang; Hai Zhao

arXiv:2305.17721·cs.CL·May 30, 2023·2 cites

Rethinking Masked Language Modeling for Chinese Spelling Correction

Hongqiu Wu, Shaohua Zhang, Yuchen Zhang, Hai Zhao

PDF

Open Access 1 Repo 7 Models

TL;DR

This paper improves Chinese Spelling Correction by analyzing model overfitting issues, introducing a diverse benchmark, and proposing a simple masking strategy that enhances language modeling and achieves state-of-the-art results.

Contribution

It identifies overfitting in BERT-based CSC models, introduces the LEMON benchmark for better evaluation, and proposes a simple masking technique to improve model generalization.

Findings

01

Random masking of 20% non-error tokens improves language modeling.

02

The proposed method achieves state-of-the-art results on multiple datasets.

03

The approach enhances out-of-distribution error pattern generalization.

Abstract

In this paper, we study Chinese Spelling Correction (CSC) as a joint decision made by two separate models: a language model and an error model. Through empirical analysis, we find that fine-tuning BERT tends to over-fit the error model while under-fit the language model, resulting in poor generalization to out-of-distribution error patterns. Given that BERT is the backbone of most CSC models, this phenomenon has a significant negative impact. To address this issue, we are releasing a multi-domain benchmark LEMON, with higher quality and diversity than existing benchmarks, to allow a comprehensive assessment of the open domain generalization of CSC models. Then, we demonstrate that a very simple strategy, randomly masking 20\% non-error tokens from the input sequence during fine-tuning is sufficient for learning a much better language model without sacrificing the error model. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gingasan/lemon
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Multi-Head Attention · Adam · Dense Connections · WordPiece · Weight Decay · Linear Warmup With Linear Decay · Attention Dropout