A data-based classification of Slavic languages: Indices of qualitative variation applied to grapheme frequencies
Michaela Koscov\'a, J\'an Macutek, Emmerich Kelih

TL;DR
This paper introduces a modified Ord's graph using indices of qualitative variation to analyze and compare grapheme frequency distributions across eleven Slavic languages, revealing meaningful linguistic relationships.
Contribution
It presents a novel modification of Ord's graph based on qualitative variation indices, enabling effective comparison of categorical linguistic data across languages.
Findings
Modified Ord's graph effectively visualizes linguistic similarities
Cluster analysis reveals meaningful relationships among Slavic languages
Original Ord's graph was less interpretable for categorical data
Abstract
The Ord's graph is a simple graphical method for displaying frequency distributions of data or theoretical distributions in the two-dimensional plane. Its coordinates are proportions of the first three moments, either empirical or theoretical ones. A modification of the Ord's graph based on proportions of indices of qualitative variation is presented. Such a modification makes the graph applicable also to data of categorical character. In addition, the indices are normalized with values between 0 and 1, which enables comparing data files divided into different numbers of categories. Both the original and the new graph are used to display grapheme frequencies in eleven Slavic languages. As the original Ord's graph requires an assignment of numbers to the categories, graphemes were ordered decreasingly according to their frequencies. Data were taken from parallel corpora, i.e., we work…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
