A hybrid text normalization system using multi-head self-attention for mandarin
Junhui Zhang, Junjie Pan, Xiang Yin, Chen Li, Shichao Liu, Yang Zhang,, Yuxuan Wang, Zejun Ma

TL;DR
This paper introduces a hybrid Mandarin text normalization system that combines rule-based and neural approaches using multi-head self-attention, achieving over 1.5% improvement in sentence-level accuracy.
Contribution
The paper presents a novel hybrid system leveraging multi-head self-attention to enhance Mandarin text normalization beyond traditional rule-based methods.
Findings
Over 1.5% improvement on sentence-level normalization accuracy
Effective handling of imbalanced pattern distribution
Potential for further performance enhancement
Abstract
In this paper, we propose a hybrid text normalization system using multi-head self-attention. The system combines the advantages of a rule-based model and a neural model for text preprocessing tasks. Previous studies in Mandarin text normalization usually use a set of hand-written rules, which are hard to improve on general cases. The idea of our proposed system is motivated by the neural models from recent studies and has a better performance on our internal news corpus. This paper also includes different attempts to deal with imbalanced pattern distribution of the dataset. Overall, the performance of the system is improved by over 1.5% on sentence-level and it has a potential to improve further.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
