Oldies but Goldies: The Potential of Character N-grams for Romanian Texts

Dana Lupsa; Sanda-Maria Avram; Radu Lupsa

arXiv:2506.15650·cs.CL·June 30, 2025

Oldies but Goldies: The Potential of Character N-grams for Romanian Texts

Dana Lupsa, Sanda-Maria Avram, Radu Lupsa

PDF

Open Access

TL;DR

This paper demonstrates that simple character n-gram features combined with machine learning models, especially neural networks, can achieve high accuracy in Romanian authorship attribution, rivaling complex methods.

Contribution

It systematically evaluates multiple machine learning techniques using character n-grams for Romanian authorship attribution, highlighting the effectiveness of lightweight approaches.

Findings

01

ANN achieved perfect classification in some runs

02

Character n-grams provide state-of-the-art accuracy

03

Lightweight methods are effective for resource-constrained settings

Abstract

This study addresses the problem of authorship attribution for Romanian texts using the ROST corpus, a standard benchmark in the field. We systematically evaluate six machine learning techniques: Support Vector Machine (SVM), Logistic Regression (LR), k-Nearest Neighbors (k-NN), Decision Trees (DT), Random Forests (RF), and Artificial Neural Networks (ANN), employing character n-gram features for classification. Among these, the ANN model achieved the highest performance, including perfect classification in four out of fifteen runs when using 5-gram features. These results demonstrate that lightweight, interpretable character n-gram approaches can deliver state-of-the-art accuracy for Romanian authorship attribution, rivaling more complex methods. Our findings highlight the potential of simple stylometric features in resource, constrained or under-studied language settings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLexicography and Language Studies · Natural Language Processing Techniques · linguistics and terminology studies