Stylistic Fingerprints, POS-tags and Inflected Languages: A Case Study   in Polish

Maciej Eder; Rafa{\l}. L. G\'orski

arXiv:2206.02208·cs.CL·November 3, 2022

Stylistic Fingerprints, POS-tags and Inflected Languages: A Case Study in Polish

Maciej Eder, Rafa{\l}. L. G\'orski

PDF

1 Repo

TL;DR

This study evaluates the effectiveness of grammatical features and lemmatization in author attribution for Polish, an inflected language, finding that these methods slightly improve classification accuracy over lexical markers alone.

Contribution

The paper provides a comparative analysis of lexical, POS-tag, and lemmatized features for authorship attribution in Polish, highlighting the modest gains from grammatical features.

Findings

01

POS-tags and lemmatized forms perform slightly worse than lexical markers.

02

The accuracy difference between feature types does not exceed 15%.

03

Grammatical features can still contribute useful information for authorship attribution.

Abstract

In stylometric investigations, frequencies of the most frequent words (MFWs) and character n-grams outperform other style-markers, even if their performance varies significantly across languages. In inflected languages, word endings play a prominent role, and hence different word forms cannot be recognized using generic text tokenization. Countless inflected word forms make frequencies sparse, making most statistical procedures complicated. Presumably, applying one of the NLP techniques, such as lemmatization and/or parsing, might increase the performance of classification. The aim of this paper is to examine the usefulness of grammatical features (as assessed via POS-tag n-grams) and lemmatized forms in recognizing authorial profiles, in order to address the underlying issue of the degree of freedom of choice within lexis and grammar. Using a corpus of Polish novels, we performed a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

computationalstylistics/pl_lemmatization_in_attribution
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.