Linguistic complexity: English vs. Polish, text vs. corpus
Jaroslaw Kwapien, Stanislaw Drozdz, Adam Orczyk

TL;DR
This study compares the rank-frequency distributions of English and Polish texts, revealing differences in scale-invariance related to lemmatization, authorship, translation, and part of speech, highlighting linguistic complexity variations.
Contribution
It provides a comparative analysis of linguistic complexity in English and Polish through rank-frequency distributions, considering lemmatization, authorship, translation, and part of speech.
Findings
Scale-invariance breaks after two decades for lemmatized words.
More pronounced scale-invariance breaking in multi-author and translated corpora.
Verbs are nearly scale-invariant when tagged with parts of speech.
Abstract
We analyze the rank-frequency distributions of words in selected English and Polish texts. We show that for the lemmatized (basic) word forms the scale-invariant regime breaks after about two decades, while it might be consistent for the whole range of ranks for the inflected word forms. We also find that for a corpus consisting of texts written by different authors the basic scale-invariant regime is broken more strongly than in the case of comparable corpus consisting of texts written by the same author. Similarly, for a corpus consisting of texts translated into Polish from other languages the scale-invariant regime is broken more strongly than for a comparable corpus of native Polish texts. Moreover, we find that if the words are tagged with their proper part of speech, only verbs show rank-frequency distribution that is almost scale-invariant.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
