CNMBERT: A Model for Converting Hanyu Pinyin Abbreviations to Chinese Characters
Zishuo Feng, Feng Cao

TL;DR
This paper introduces CNMBERT, a BERT-based model with multi-mask and MoE layers, to accurately convert Hanyu Pinyin abbreviations into Chinese characters, outperforming GPT models on a large test dataset.
Contribution
The paper presents CNMBERT, a novel BERT-based approach with multi-mask and MoE layers, specifically designed for Pinyin abbreviation to Chinese character conversion.
Findings
CNMBERT achieves 61.53% MRR score.
CNMBERT attains 51.86% accuracy.
Outperforms fine-tuned GPT and ChatGPT-4o models.
Abstract
The task of converting Hanyu Pinyin abbreviations to Chinese characters is a significant branch within the domain of Chinese Spelling Correction (CSC). It plays an important role in many downstream applications such as named entity recognition and sentiment analysis. This task typically involves text-length alignment and seems easy to solve; however, due to the limited information content in pinyin abbreviations, achieving accurate conversion is challenging. In this paper, we treat this as a fill-mask task and propose CNMBERT, which stands for zh-CN Pinyin Multi-mask BERT Model, as a solution to this issue. By introducing a multi-mask strategy and Mixture of Experts (MoE) layers, CNMBERT outperforms fine-tuned GPT models and ChatGPT-4o with a 61.53% MRR score and 51.86% accuracy on a 10,373-sample test dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices · Lexicography and Language Studies
MethodsAttention Is All You Need · Cosine Annealing · Discriminative Fine-Tuning · Linear Warmup With Cosine Annealing · Adam · Residual Connection · Weight Decay · Byte Pair Encoding · Linear Layer · Multi-Head Attention
