Representations of Language Varieties Are Reliable Given Corpus   Similarity Measures

Jonathan Dunn

arXiv:2104.01294·cs.CL·April 6, 2021·6 cites

Representations of Language Varieties Are Reliable Given Corpus Similarity Measures

Jonathan Dunn

PDF

Open Access 1 Repo

TL;DR

This study evaluates whether digital geo-referenced corpora reliably represent local language varieties by analyzing similarity measures across multiple languages and sources, confirming their stability and consistency.

Contribution

It demonstrates that frequency-based corpus similarity measures reliably capture linguistic variation across diverse digital sources and language varieties.

Findings

01

High agreement between sources indicates reliable representation of language varieties.

02

Corpus similarity measures are stable across different languages and regions.

03

Digital corpora effectively model linguistic variation in geo-referenced data.

Abstract

This paper measures similarity both within and between 84 language varieties across nine languages. These corpora are drawn from digital sources (the web and tweets), allowing us to evaluate whether such geo-referenced corpora are reliable for modelling linguistic variation. The basic idea is that, if each source adequately represents a single underlying language variety, then the similarity between these sources should be stable across all languages and countries. The paper shows that there is a consistent agreement between these sources using frequency-based corpus similarity measures. This provides further evidence that digital geo-referenced corpora consistently represent local language varieties.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jonathandunn/corpus_similarity
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLinguistic Variation and Morphology · Natural Language Processing Techniques · Authorship Attribution and Profiling