Approaching the linguistic complexity
Stanislaw Drozdz, Jaroslaw Kwapien, Adam Orczyk

TL;DR
This paper investigates the rank-frequency distributions of words in English and Polish texts, revealing how authorship, translation, and grammatical categories influence linguistic scaling properties.
Contribution
It provides a comparative analysis of linguistic scaling in different corpora and highlights the unique scaling behavior of verbs across languages and contexts.
Findings
Scaling regimes are more strongly broken in multi-author and translated corpora.
Lemmas do not exhibit scaling when analyzed by part of speech, except for verbs.
Verbs show a distinct trace of scaling independently of other parts of speech.
Abstract
We analyze the rank-frequency distributions of words in selected English and Polish texts. We compare scaling properties of these distributions in both languages. We also study a few small corpora of Polish literary texts and find that for a corpus consisting of texts written by different authors the basic scaling regime is broken more strongly than in the case of comparable corpus consisting of texts written by the same author. Similarly, for a corpus consisting of texts translated into Polish from other languages the scaling regime is broken more strongly than for a comparable corpus of native Polish texts. Moreover, based on the British National Corpus, we consider the rank-frequency distributions of the grammatically basic forms of words (lemmas) tagged with their proper part of speech. We find that these distributions do not scale if each part of speech is analyzed separately. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Opinion Dynamics and Social Influence · Language and cultural evolution
