Oldies but Goldies: The Potential of Character N-grams for Romanian Texts
Dana Lupsa, Sanda-Maria Avram, Radu Lupsa

TL;DR
This paper demonstrates that simple character n-gram features combined with machine learning models, especially neural networks, can achieve high accuracy in Romanian authorship attribution, rivaling complex methods.
Contribution
It systematically evaluates multiple machine learning techniques using character n-grams for Romanian authorship attribution, highlighting the effectiveness of lightweight approaches.
Findings
ANN achieved perfect classification in some runs
Character n-grams provide state-of-the-art accuracy
Lightweight methods are effective for resource-constrained settings
Abstract
This study addresses the problem of authorship attribution for Romanian texts using the ROST corpus, a standard benchmark in the field. We systematically evaluate six machine learning techniques: Support Vector Machine (SVM), Logistic Regression (LR), k-Nearest Neighbors (k-NN), Decision Trees (DT), Random Forests (RF), and Artificial Neural Networks (ANN), employing character n-gram features for classification. Among these, the ANN model achieved the highest performance, including perfect classification in four out of fifteen runs when using 5-gram features. These results demonstrate that lightweight, interpretable character n-gram approaches can deliver state-of-the-art accuracy for Romanian authorship attribution, rivaling more complex methods. Our findings highlight the potential of simple stylometric features in resource, constrained or under-studied language settings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLexicography and Language Studies · Natural Language Processing Techniques · linguistics and terminology studies
