A hybrid text normalization system using multi-head self-attention for   mandarin

Junhui Zhang; Junjie Pan; Xiang Yin; Chen Li; Shichao Liu; Yang Zhang,; Yuxuan Wang; Zejun Ma

arXiv:1911.04128·cs.CL·February 11, 2020

A hybrid text normalization system using multi-head self-attention for mandarin

Junhui Zhang, Junjie Pan, Xiang Yin, Chen Li, Shichao Liu, Yang Zhang,, Yuxuan Wang, Zejun Ma

PDF

Open Access

TL;DR

This paper introduces a hybrid Mandarin text normalization system that combines rule-based and neural approaches using multi-head self-attention, achieving over 1.5% improvement in sentence-level accuracy.

Contribution

The paper presents a novel hybrid system leveraging multi-head self-attention to enhance Mandarin text normalization beyond traditional rule-based methods.

Findings

01

Over 1.5% improvement on sentence-level normalization accuracy

02

Effective handling of imbalanced pattern distribution

03

Potential for further performance enhancement

Abstract

In this paper, we propose a hybrid text normalization system using multi-head self-attention. The system combines the advantages of a rule-based model and a neural model for text preprocessing tasks. Previous studies in Mandarin text normalization usually use a set of hand-written rules, which are hard to improve on general cases. The idea of our proposed system is motivated by the neural models from recent studies and has a better performance on our internal news corpus. This paper also includes different attempts to deal with imbalanced pattern distribution of the dataset. Overall, the performance of the system is improved by over 1.5% on sentence-level and it has a potential to improve further.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis