Linguistic complexity: English vs. Polish, text vs. corpus

Jaroslaw Kwapien; Stanislaw Drozdz; Adam Orczyk

arXiv:1007.0936·cs.CL·July 7, 2010

Linguistic complexity: English vs. Polish, text vs. corpus

Jaroslaw Kwapien, Stanislaw Drozdz, Adam Orczyk

PDF

TL;DR

This study compares the rank-frequency distributions of English and Polish texts, revealing differences in scale-invariance related to lemmatization, authorship, translation, and part of speech, highlighting linguistic complexity variations.

Contribution

It provides a comparative analysis of linguistic complexity in English and Polish through rank-frequency distributions, considering lemmatization, authorship, translation, and part of speech.

Findings

01

Scale-invariance breaks after two decades for lemmatized words.

02

More pronounced scale-invariance breaking in multi-author and translated corpora.

03

Verbs are nearly scale-invariant when tagged with parts of speech.

Abstract

We analyze the rank-frequency distributions of words in selected English and Polish texts. We show that for the lemmatized (basic) word forms the scale-invariant regime breaks after about two decades, while it might be consistent for the whole range of ranks for the inflected word forms. We also find that for a corpus consisting of texts written by different authors the basic scale-invariant regime is broken more strongly than in the case of comparable corpus consisting of texts written by the same author. Similarly, for a corpus consisting of texts translated into Polish from other languages the scale-invariant regime is broken more strongly than for a comparable corpus of native Polish texts. Moreover, we find that if the words are tagged with their proper part of speech, only verbs show rank-frequency distribution that is almost scale-invariant.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.