Race, Religion and the City: Twitter Word Frequency Patterns Reveal Dominant Demographic Dimensions in the United States
Eszter Bok\'anyi, D\'aniel Kondor, L\'aszl\'o Dobos, Tam\'as, Seb\H{o}k, J\'ozsef St\'eger, Istv\'an Csabai, G\'abor Vattay

TL;DR
This study analyzes over half a billion Twitter messages to uncover spatial language patterns in the US, revealing demographic traits like religion, ethnicity, and urbanization through unsupervised analysis.
Contribution
It introduces an unsupervised method combining LSA and RPCA to identify demographic-related language patterns in large-scale social media data.
Findings
Language patterns correlate with census demographic data
Patterns relate to slang, urbanization, religion, ethnicity
Demography is inherently reflected in OSN language use
Abstract
Recently, numerous approaches have emerged in the social sciences to exploit the opportunities made possible by the vast amounts of data generated by online social networks (OSNs). Having access to information about users on such a scale opens up a range of possibilities, all without the limitations associated with often slow and expensive paper-based polls. A question that remains to be satisfactorily addressed, however, is how demography is represented in the OSN content? Here, we study language use in the US using a corpus of text compiled from over half a billion geo-tagged messages from the online microblogging platform Twitter. Our intention is to reveal the most important spatial patterns in language use in an unsupervised manner and relate them to demographics. Our approach is based on Latent Semantic Analysis (LSA) augmented with the Robust Principal Component Analysis (RPCA)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Mobility and Location-Based Analysis · Media Influence and Politics · Social Media and Politics
