Non-Standard Words as Features for Text Categorization
Slobodan Beliga, Sanda Martin\v{c}i\'c-Ip\v{s}i\'c

TL;DR
This study explores using Non-Standard Words as features for Croatian text categorization, demonstrating their effectiveness and potential to simplify feature space without lemmatization.
Contribution
It introduces a novel feature set based on NSW frequencies and statistics for Croatian text categorization, achieving high accuracy and reducing feature dimensionality.
Findings
NSW frequencies yield 87% accuracy in categorization
NSW features outperform other statistical measures
Using NSW features simplifies feature space without lemmatization
Abstract
This paper presents categorization of Croatian texts using Non-Standard Words (NSW) as features. Non-Standard Words are: numbers, dates, acronyms, abbreviations, currency, etc. NSWs in Croatian language are determined according to Croatian NSW taxonomy. For the purpose of this research, 390 text documents were collected and formed the SKIPEZ collection with 6 classes: official, literary, informative, popular, educational and scientific. Text categorization experiment was conducted on three different representations of the SKIPEZ collection: in the first representation, the frequencies of NSWs are used as features; in the second representation, the statistic measures of NSWs (variance, coefficient of variation, standard deviation, etc.) are used as features; while the third representation combines the first two feature sets. Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
