Confounds and Consequences in Geotagged Twitter Data
Umashanthi Pavalanathan, Jacob Eisenstein

TL;DR
This paper compares GPS and profile location data from Twitter, revealing biases and demographic influences on linguistic analysis and geolocation accuracy, which vary across user age and gender groups.
Contribution
It systematically analyzes biases in geotagged Twitter data and models demographic effects on language and geolocation performance.
Findings
GPS and profile locations produce different corpora.
Demographic variables influence linguistic differences.
Geolocation accuracy varies with age and gender, favoring men over 40.
Abstract
Twitter is often used in quantitative studies that identify geographically-preferred topics, writing styles, and entities. These studies rely on either GPS coordinates attached to individual messages, or on the user-supplied location field in each profile. In this paper, we compare these data acquisition techniques and quantify the biases that they introduce; we also measure their effects on linguistic analysis and text-based geolocation. GPS-tagging and self-reported locations yield measurably different corpora, and these linguistic differences are partially attributable to differences in dataset composition by age and gender. Using a latent variable model to induce age and gender, we show how these demographic variables interact with geography to affect language use. We also show that the accuracy of text-based geolocation varies with population demographics, giving the best results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
