Comparing Measures of Linguistic Diversity Across Social Media Language Data and Census Data at Subnational Geographic Areas
Sidney G.-J. Wong, Jonathan Dunn, Benjamin Adams

TL;DR
This study compares linguistic diversity measures from social media and census data in New Zealand, exploring how online language reflects real-world populations and the potential for monitoring linguistic changes geographically.
Contribution
It provides a preliminary comparison of online and real-world linguistic diversity, highlighting the potential and challenges of using social media data for spatial linguistic analysis.
Findings
Social media language data shows potential for tracking linguistic diversity geographically.
Alignment between social media users and real-world populations is partial and requires further investigation.
Further research needed to validate social media as a proxy for real-world linguistic behavior.
Abstract
This paper describes a preliminary study on the comparative linguistic ecology of online spaces (i.e., social media language data) and real-world spaces in Aotearoa New Zealand (i.e., subnational administrative areas). We compare measures of linguistic diversity between these different spaces and discuss how social media users align with real-world populations. The results from the current study suggests that there is potential to use online social media language data to observe spatial and temporal changes in linguistic diversity at subnational geographic areas; however, further work is required to understand how well social media represents real-world behaviour.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLinguistic Variation and Morphology · Multilingual Education and Policy · Digital Communication and Language
MethodsALIGN
