Detecting Linguistic Diversity on Social Media
Sidney Wong, Benjamin Adams, Jonathan Dunn

TL;DR
This study assesses social media data's potential to analyze linguistic diversity in New Zealand, comparing it with census data to reveal spatial and temporal language patterns at various geographic levels.
Contribution
It demonstrates that social media language data can effectively reflect demographic and sociopolitical linguistic changes at multiple geographic scales.
Findings
Social media data aligns with census data on language use.
Social media captures regional linguistic variations.
Language dynamics are detectable through social media over time.
Abstract
This chapter explores the efficacy of using social media data to examine changing linguistic behaviour of a place. We focus our investigation on Aotearoa New Zealand where official statistics from the census is the only source of language use data. We use published census data as the ground truth and the social media sub-corpus from the Corpus of Global Language Use as our alternative data source. We use place as the common denominator between the two data sources. We identify the language conditions of each tweet in the social media data set and validated our results with two language identification models. We then compare levels of linguistic diversity at national, regional, and local geographies. The results suggest that social media language data has the possibility to provide a rich source of spatial and temporal insights on the linguistic profile of a place. We show that social…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
