CNMBERT: A Model for Converting Hanyu Pinyin Abbreviations to Chinese   Characters

Zishuo Feng; Feng Cao

arXiv:2411.11770·cs.CL·January 29, 2025

CNMBERT: A Model for Converting Hanyu Pinyin Abbreviations to Chinese Characters

Zishuo Feng, Feng Cao

PDF

Open Access 1 Repo 2 Models

TL;DR

This paper introduces CNMBERT, a BERT-based model with multi-mask and MoE layers, to accurately convert Hanyu Pinyin abbreviations into Chinese characters, outperforming GPT models on a large test dataset.

Contribution

The paper presents CNMBERT, a novel BERT-based approach with multi-mask and MoE layers, specifically designed for Pinyin abbreviation to Chinese character conversion.

Findings

01

CNMBERT achieves 61.53% MRR score.

02

CNMBERT attains 51.86% accuracy.

03

Outperforms fine-tuned GPT and ChatGPT-4o models.

Abstract

The task of converting Hanyu Pinyin abbreviations to Chinese characters is a significant branch within the domain of Chinese Spelling Correction (CSC). It plays an important role in many downstream applications such as named entity recognition and sentiment analysis. This task typically involves text-length alignment and seems easy to solve; however, due to the limited information content in pinyin abbreviations, achieving accurate conversion is challenging. In this paper, we treat this as a fill-mask task and propose CNMBERT, which stands for zh-CN Pinyin Multi-mask BERT Model, as a solution to this issue. By introducing a multi-mask strategy and Mixture of Experts (MoE) layers, CNMBERT outperforms fine-tuned GPT models and ChatGPT-4o with a 61.53% MRR score and 51.86% accuracy on a 10,373-sample test dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

igarashiakatuki/cnmbert
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Translation Studies and Practices · Lexicography and Language Studies

MethodsAttention Is All You Need · Cosine Annealing · Discriminative Fine-Tuning · Linear Warmup With Cosine Annealing · Adam · Residual Connection · Weight Decay · Byte Pair Encoding · Linear Layer · Multi-Head Attention