TL;DR
This study evaluates the effectiveness of grammatical features and lemmatization in author attribution for Polish, an inflected language, finding that these methods slightly improve classification accuracy over lexical markers alone.
Contribution
The paper provides a comparative analysis of lexical, POS-tag, and lemmatized features for authorship attribution in Polish, highlighting the modest gains from grammatical features.
Findings
POS-tags and lemmatized forms perform slightly worse than lexical markers.
The accuracy difference between feature types does not exceed 15%.
Grammatical features can still contribute useful information for authorship attribution.
Abstract
In stylometric investigations, frequencies of the most frequent words (MFWs) and character n-grams outperform other style-markers, even if their performance varies significantly across languages. In inflected languages, word endings play a prominent role, and hence different word forms cannot be recognized using generic text tokenization. Countless inflected word forms make frequencies sparse, making most statistical procedures complicated. Presumably, applying one of the NLP techniques, such as lemmatization and/or parsing, might increase the performance of classification. The aim of this paper is to examine the usefulness of grammatical features (as assessed via POS-tag n-grams) and lemmatized forms in recognizing authorial profiles, in order to address the underlying issue of the degree of freedom of choice within lexis and grammar. Using a corpus of Polish novels, we performed a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
