Storywrangler: A massive exploratorium for sociolinguistic, cultural,   socioeconomic, and political timelines using Twitter

Thayer Alshaabi; Jane L. Adams; Michael V. Arnold; Joshua R. Minot,; David R. Dewhurst; Andrew J. Reagan; Christopher M. Danforth; and Peter; Sheridan Dodds

arXiv:2007.12988·cs.SI·July 20, 2021

Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter

Thayer Alshaabi, Jane L. Adams, Michael V. Arnold, Joshua R. Minot,, David R. Dewhurst, Andrew J. Reagan, Christopher M. Danforth, and Peter, Sheridan Dodds

PDF

5 Repos

TL;DR

Storywrangler is a comprehensive tool that analyzes over a decade of Twitter data to track linguistic, cultural, and social trends in real-time, enabling diverse sociolinguistic and sociopolitical research.

Contribution

It introduces a large-scale, real-time Twitter data curation system for tracking n-gram usage across multiple languages, with interactive visualization and extensibility to other social media platforms.

Findings

01

Over 100 billion tweets analyzed from 2008 to 2021

02

Provides interactive and downloadable time series data

03

Enables case studies linking social media trends to real-world events

Abstract

In real-time, social media data strongly imprints world events, popular culture, and day-to-day conversations by millions of ordinary people at a scale that is scarcely conventionalized and recorded. Vitally, and absent from many standard corpora such as books and news archives, sharing and commenting mechanisms are native to social media platforms, enabling us to quantify social amplification (i.e., popularity) of trending storylines and contemporary cultural phenomena. Here, we describe Storywrangler, a natural language processing instrument designed to carry out an ongoing, day-scale curation of over 100 billion tweets containing roughly 1 trillion 1-grams from 2008 to 2021. For each day, we break tweets into unigrams, bigrams, and trigrams spanning over 100 languages. We track n-gram usage frequencies, and generate Zipf distributions, for words, hashtags, handles, numerals, symbols,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.