Non-Standard Vietnamese Word Detection and Normalization for Text-to-Speech
Huu-Tien Dang, Thi-Hai-Yen Vuong, Xuan-Hieu Phan

TL;DR
This paper presents a two-phase approach for Vietnamese non-standard word detection and normalization in TTS systems, utilizing advanced models and rule-based algorithms to improve accuracy and handle diverse NSW types.
Contribution
It introduces a novel combination of model-based tagging and rule-based normalization specifically tailored for Vietnamese NSWs in TTS applications.
Findings
BiLSTM-CNN-CRF and BERT-BiGRU-CRF models achieve over 90% F1 scores.
The approach reduces sentence error rates to below 8%.
BERT-BiGRU-CRF yields the highest F1 score of 95%.
Abstract
Converting written texts into their spoken forms is an essential problem in any text-to-speech (TTS) systems. However, building an effective text normalization solution for a real-world TTS system face two main challenges: (1) the semantic ambiguity of non-standard words (NSWs), e.g., numbers, dates, ranges, scores, abbreviations, and (2) transforming NSWs into pronounceable syllables, such as URL, email address, hashtag, and contact name. In this paper, we propose a new two-phase normalization approach to deal with these challenges. First, a model-based tagger is designed to detect NSWs. Then, depending on NSW types, a rule-based normalizer expands those NSWs into their final verbal forms. We conducted three empirical experiments for NSW detection using Conditional Random Fields (CRFs), BiLSTM-CNN-CRF, and BERT-BiGRU-CRF models on a manually annotated dataset including 5819 sentences…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsConditional Random Field
