BERT-based Authorship Attribution on the Romanian Dataset called ROST
Sanda-Maria Avram

TL;DR
This paper applies BERT-based models to authorship attribution on a diverse and unbalanced Romanian dataset, achieving high accuracy and demonstrating the effectiveness of pre-trained language models in this task.
Contribution
It introduces a BERT-based approach for Romanian authorship attribution on a challenging, unbalanced dataset, showing promising results.
Findings
Achieved up to 87% macro-accuracy.
Effectiveness of BERT in handling unbalanced, multilingual datasets.
Demonstrated robustness across various text types and sources.
Abstract
Being around for decades, the problem of Authorship Attribution is still very much in focus currently. Some of the more recent instruments used are the pre-trained language models, the most prevalent being BERT. Here we used such a model to detect the authorship of texts written in the Romanian language. The dataset used is highly unbalanced, i.e., significant differences in the number of texts per author, the sources from which the texts were collected, the time period in which the authors lived and wrote these texts, the medium intended to be read (i.e., paper or online), and the type of writing (i.e., stories, short stories, fairy tales, novels, literary articles, and sketches). The results are better than expected, sometimes exceeding 87\% macro-accuracy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Topic Modeling
MethodsAttention Is All You Need · Linear Layer · Adam · Layer Normalization · Weight Decay · Multi-Head Attention · Residual Connection · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia? · Dropout
