Corpus Similarity Measures Remain Robust Across Diverse Languages
Haipeng Li, Jonathan Dunn

TL;DR
This study demonstrates that frequency-based corpus similarity measures are effective and robust across 39 diverse languages, including low-resource and out-of-domain corpora, for evaluating linguistic generalizations.
Contribution
It provides empirical evidence that corpus similarity measures remain valid across diverse languages and domains, extending previous work beyond Indo-European languages.
Findings
Measures are valid across different language families and writing systems.
Measures remain robust on out-of-domain and low-resource corpora.
Similarity measures effectively distinguish between different text registers.
Abstract
This paper experiments with frequency-based corpus similarity measures across 39 languages using a register prediction task. The goal is to quantify (i) the distance between different corpora from the same language and (ii) the homogeneity of individual corpora. Both of these goals are essential for measuring how well corpus-based linguistic analysis generalizes from one dataset to another. The problem is that previous work has focused on Indo-European languages, raising the question of whether these measures are able to provide robust generalizations across diverse languages. This paper uses a register prediction task to evaluate competing measures across 39 languages: how well are they able to distinguish between corpora representing different contexts of production? Each experiment compares three corpora from a single language, with the same three digital registers shared across all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Topic Modeling
