Validating and Exploring Large Geographic Corpora
Jonathan Dunn

TL;DR
This study examines how different data cleaning methods affect the quality of large multilingual geographic web corpora, highlighting their uneven impact on under-represented languages and populations.
Contribution
It introduces a systematic evaluation of corpus creation steps and their effects on the validity of sub-corpora for specific languages and regions.
Findings
Data cleaning improves corpus validity at each stage.
Impact of cleaning is uneven across languages.
Standard techniques may exclude under-represented groups.
Abstract
This paper investigates the impact of corpus creation decisions on large multi-lingual geographic web corpora. Beginning with a 427 billion word corpus derived from the Common Crawl, three methods are used to improve the quality of sub-corpora representing specific language-country pairs like New Zealand English: (i) the agreement of independent language identification systems, (ii) hash-based deduplication, and (iii) location-specific outlier detection. The impact of each of these steps is then evaluated at the language level and the country level by using corpus similarity measures to compare each resulting corpus with baseline data sets. The goal is to understand the impact of upstream data cleaning decisions on downstream corpora with a specific focus on under-represented languages and populations. The evaluation shows that the validity of sub-corpora is improved with each stage of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Geographic Information Systems Studies · Text and Document Classification Technologies
MethodsFocus
