A data-based classification of Slavic languages: Indices of qualitative   variation applied to grapheme frequencies

Michaela Koscov\'a; J\'an Macutek; Emmerich Kelih

arXiv:1504.03608·stat.AP·April 15, 2015

A data-based classification of Slavic languages: Indices of qualitative variation applied to grapheme frequencies

Michaela Koscov\'a, J\'an Macutek, Emmerich Kelih

PDF

TL;DR

This paper introduces a modified Ord's graph using indices of qualitative variation to analyze and compare grapheme frequency distributions across eleven Slavic languages, revealing meaningful linguistic relationships.

Contribution

It presents a novel modification of Ord's graph based on qualitative variation indices, enabling effective comparison of categorical linguistic data across languages.

Findings

01

Modified Ord's graph effectively visualizes linguistic similarities

02

Cluster analysis reveals meaningful relationships among Slavic languages

03

Original Ord's graph was less interpretable for categorical data

Abstract

The Ord's graph is a simple graphical method for displaying frequency distributions of data or theoretical distributions in the two-dimensional plane. Its coordinates are proportions of the first three moments, either empirical or theoretical ones. A modification of the Ord's graph based on proportions of indices of qualitative variation is presented. Such a modification makes the graph applicable also to data of categorical character. In addition, the indices are normalized with values between 0 and 1, which enables comparing data files divided into different numbers of categories. Both the original and the new graph are used to display grapheme frequencies in eleven Slavic languages. As the original Ord's graph requires an assignment of numbers to the categories, graphemes were ordered decreasingly according to their frequencies. Data were taken from parallel corpora, i.e., we work…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.