Confounds and Consequences in Geotagged Twitter Data

Umashanthi Pavalanathan; Jacob Eisenstein

arXiv:1506.02275·cs.CL·August 25, 2015

Confounds and Consequences in Geotagged Twitter Data

Umashanthi Pavalanathan, Jacob Eisenstein

PDF

TL;DR

This paper compares GPS and profile location data from Twitter, revealing biases and demographic influences on linguistic analysis and geolocation accuracy, which vary across user age and gender groups.

Contribution

It systematically analyzes biases in geotagged Twitter data and models demographic effects on language and geolocation performance.

Findings

01

GPS and profile locations produce different corpora.

02

Demographic variables influence linguistic differences.

03

Geolocation accuracy varies with age and gender, favoring men over 40.

Abstract

Twitter is often used in quantitative studies that identify geographically-preferred topics, writing styles, and entities. These studies rely on either GPS coordinates attached to individual messages, or on the user-supplied location field in each profile. In this paper, we compare these data acquisition techniques and quantify the biases that they introduce; we also measure their effects on linguistic analysis and text-based geolocation. GPS-tagging and self-reported locations yield measurably different corpora, and these linguistic differences are partially attributable to differences in dataset composition by age and gender. Using a latent variable model to induce age and gender, we show how these demographic variables interact with geography to affect language use. We also show that the accuracy of text-based geolocation varies with population demographics, giving the best results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.