Normalization of Non-Standard Words in Croatian Texts

Slobodan Beliga; Miran Pobar; Sanda Martin\v{c}i\'c-Ip\v{s}i\'c

arXiv:1503.08167·cs.CL·March 31, 2015

Normalization of Non-Standard Words in Croatian Texts

Slobodan Beliga, Miran Pobar, Sanda Martin\v{c}i\'c-Ip\v{s}i\'c

PDF

Open Access

TL;DR

This paper introduces a rule-based text normalization approach for Croatian, expanding non-standard words with high accuracy, crucial for improving text-to-speech systems in the language.

Contribution

It proposes a comprehensive taxonomy and normalization methods combining rules and lookup dictionaries tailored for Croatian language.

Findings

01

Token normalization rate of 95% achieved.

02

80% of expanded words are morphologically correct.

03

Method enhances text-to-speech system accuracy.

Abstract

This paper presents text normalization which is an integral part of any text-to-speech synthesis system. Text normalization is a set of methods with a task to write non-standard words, like numbers, dates, times, abbreviations, acronyms and the most common symbols, in their full expanded form are presented. The whole taxonomy for classification of non-standard words in Croatian language together with rule-based normalization methods combined with a lookup dictionary are proposed. Achieved token rate for normalization of Croatian texts is 95%, where 80% of expanded words are in correct morphological form.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Speech and dialogue systems