Exploring the Usage of Chinese Pinyin in Pretraining

Baojun Wang; Kun Xu; Lifeng Shang

arXiv:2310.04960·cs.CL·October 10, 2023

Exploring the Usage of Chinese Pinyin in Pretraining

Baojun Wang, Kun Xu, Lifeng Shang

PDF

Open Access

TL;DR

This paper introduces PmBERT, a pretraining method that fuses Chinese characters and pinyin to improve error tolerance in NLP tasks, especially for pronunciation-related errors.

Contribution

It proposes a novel parallel pretraining approach combining characters and pinyin, enhancing robustness against SSP pronunciation errors in Chinese NLP models.

Findings

01

PmBERT outperforms SOTA models on error-correction datasets.

02

Fusing pinyin with characters improves phonetic robustness.

03

Pretraining tasks effectively integrate phonetic information.

Abstract

Unlike alphabetic languages, Chinese spelling and pronunciation are different. Both characters and pinyin take an important role in Chinese language understanding. In Chinese NLP tasks, we almost adopt characters or words as model input, and few works study how to use pinyin. However, pinyin is essential in many scenarios, such as error correction and fault tolerance for ASR-introduced errors. Most of these errors are caused by the same or similar pronunciation words, and we refer to this type of error as SSP(the same or similar pronunciation) errors for short. In this work, we explore various ways of using pinyin in pretraining models and propose a new pretraining method called PmBERT. Our method uses characters and pinyin in parallel for pretraining. Through delicate pretraining tasks, the characters and pinyin representation are fused, which can enhance the error tolerance for SSP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification