PERT: A New Solution to Pinyin to Character Conversion Task
Jinghui Xiao, Qun Liu, Xin Jiang, Yuanfeng Xiong, Haiteng Wu, Zhe, Zhang

TL;DR
This paper introduces PERT, a transformer-based model for Pinyin to Character conversion, significantly improving performance over traditional methods and effectively handling out-of-dictionary issues in input method engines.
Contribution
The paper proposes PERT, a novel transformer-based approach for Pinyin to Character conversion, and demonstrates its effectiveness and improvements when combined with n-gram models and external lexicons.
Findings
PERT outperforms baseline models in P2C tasks.
Combining PERT with n-gram models yields further accuracy gains.
Incorporating external lexicons helps address OOD issues.
Abstract
Pinyin to Character conversion (P2C) task is the key task of Input Method Engine (IME) in commercial input software for Asian languages, such as Chinese, Japanese, Thai language and so on. It's usually treated as sequence labelling task and resolved by language model, i.e. n-gram or RNN. However, the low capacity of the n-gram or RNN limits its performance. This paper introduces a new solution named PERT which stands for bidirectional Pinyin Encoder Representations from Transformers. It achieves significant improvement of performance over baselines. Furthermore, we combine PERT with n-gram under a Markov framework, and improve performance further. Lastly, the external lexicon is incorporated into PERT so as to resolve the OOD issue of IME.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Web Data Mining and Analysis
