Crowdsourcing Dialect Characterization through Twitter

Bruno Gon\c{c}alves; David S\'anchez

arXiv:1407.7094·physics.soc-ph·November 20, 2014

Crowdsourcing Dialect Characterization through Twitter

Bruno Gon\c{c}alves, David S\'anchez

PDF

TL;DR

This study analyzes Spanish dialects worldwide using geotagged Twitter data, revealing two main superdialects—urban and rural—and their regional variations through large-scale lexical clustering.

Contribution

It introduces a large-scale, data-driven approach to characterize Spanish dialects globally using Twitter, identifying macroregional dialectal patterns.

Findings

01

Spanish is divided into two superdialects: urban and rural.

02

Urban speech is widespread across major American and Spanish cities.

03

Rural dialects form smaller, regionally distinct clusters.

Abstract

We perform a large-scale analysis of language diatopic variation using geotagged microblogging datasets. By collecting all Twitter messages written in Spanish over more than two years, we build a corpus from which a carefully selected list of concepts allows us to characterize Spanish varieties on a global scale. A cluster analysis proves the existence of well defined macroregions sharing common lexical properties. Remarkably enough, we find that Spanish language is split into two superdialects, namely, an urban speech used across major American and Spanish citites and a diverse form that encompasses rural areas and small towns. The latter can be further clustered into smaller varieties with a stronger regional character.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.