Non-Standard Words as Features for Text Categorization

Slobodan Beliga; Sanda Martin\v{c}i\'c-Ip\v{s}i\'c

arXiv:1408.6746·cs.CL·November 18, 2014

Non-Standard Words as Features for Text Categorization

Slobodan Beliga, Sanda Martin\v{c}i\'c-Ip\v{s}i\'c

PDF

TL;DR

This study explores using Non-Standard Words as features for Croatian text categorization, demonstrating their effectiveness and potential to simplify feature space without lemmatization.

Contribution

It introduces a novel feature set based on NSW frequencies and statistics for Croatian text categorization, achieving high accuracy and reducing feature dimensionality.

Findings

01

NSW frequencies yield 87% accuracy in categorization

02

NSW features outperform other statistical measures

03

Using NSW features simplifies feature space without lemmatization

Abstract

This paper presents categorization of Croatian texts using Non-Standard Words (NSW) as features. Non-Standard Words are: numbers, dates, acronyms, abbreviations, currency, etc. NSWs in Croatian language are determined according to Croatian NSW taxonomy. For the purpose of this research, 390 text documents were collected and formed the SKIPEZ collection with 6 classes: official, literary, informative, popular, educational and scientific. Text categorization experiment was conducted on three different representations of the SKIPEZ collection: in the first representation, the frequencies of NSWs are used as features; in the second representation, the statistic measures of NSWs (variance, coefficient of variation, standard deviation, etc.) are used as features; while the third representation combines the first two feature sets. Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.