Normalization of Non-Standard Words in Croatian Texts
Slobodan Beliga, Miran Pobar, Sanda Martin\v{c}i\'c-Ip\v{s}i\'c

TL;DR
This paper introduces a rule-based text normalization approach for Croatian, expanding non-standard words with high accuracy, crucial for improving text-to-speech systems in the language.
Contribution
It proposes a comprehensive taxonomy and normalization methods combining rules and lookup dictionaries tailored for Croatian language.
Findings
Token normalization rate of 95% achieved.
80% of expanded words are morphologically correct.
Method enhances text-to-speech system accuracy.
Abstract
This paper presents text normalization which is an integral part of any text-to-speech synthesis system. Text normalization is a set of methods with a task to write non-standard words, like numbers, dates, times, abbreviations, acronyms and the most common symbols, in their full expanded form are presented. The whole taxonomy for classification of non-standard words in Croatian language together with rule-based normalization methods combined with a lookup dictionary are proposed. Achieved token rate for normalization of Croatian texts is 95%, where 80% of expanded words are in correct morphological form.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Speech and dialogue systems
