Minimally Supervised Written-to-Spoken Text Normalization
Ke Wu, Kyle Gorman, and Richard Sproat

TL;DR
This paper explores minimally supervised methods for text normalization in speech applications, comparing approaches with varying levels of language-specific knowledge and data availability, and evaluates their effectiveness on English and Russian.
Contribution
It introduces and evaluates a framework for text normalization that reduces reliance on extensive hand-crafted grammars and aligned data, using universal covering grammars and hallucinated data.
Findings
Universal covering grammars perform competitively with hand-crafted grammars.
Hallucinated data can effectively substitute for aligned corpora in training.
Approaches are validated on both English and Russian datasets.
Abstract
In speech-applications such as text-to-speech (TTS) or automatic speech recognition (ASR), \emph{text normalization} refers to the task of converting from a \emph{written} representation into a representation of how the text is to be \emph{spoken}. In all real-world speech applications, the text normalization engine is developed---in large part---by hand. For example, a hand-built grammar may be used to enumerate the possible ways of saying a given token in a given language, and a statistical model used to select the most appropriate pronunciation in context. In this study we examine the tradeoffs associated with using more or less language-specific domain knowledge in a text normalization engine. In the most data-rich scenario, we have access to a carefully constructed hand-built normalization grammar that for any given token will produce a set of all possible verbalizations for that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
