The Remarkable Benefit of User-Level Aggregation for Lexical-based   Population-Level Predictions

Salvatore Giorgi; Daniel Preotiuc-Pietro; Anneke Buffone; Daniel; Rieman; Lyle H. Ungar; H. Andrew Schwartz

arXiv:1808.09600·cs.SI·August 30, 2018

The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions

Salvatore Giorgi, Daniel Preotiuc-Pietro, Anneke Buffone, Daniel, Rieman, Lyle H. Ungar, H. Andrew Schwartz

PDF

TL;DR

This paper demonstrates that aggregating Twitter data at the user level significantly improves community-level predictions of demographic, health, and psychological outcomes, outperforming standard aggregation methods.

Contribution

It introduces a simple user-level aggregation method for social media data that enhances the accuracy of community outcome predictions.

Findings

01

Improved prediction accuracy for median income (r=.73 to .82).

02

Enhanced prediction of life satisfaction (r=.37 to .47).

03

Provided a large dataset of 37 billion tweets for research.

Abstract

Nowcasting based on social media text promises to provide unobtrusive and near real-time predictions of community-level outcomes. These outcomes are typically regarding people, but the data is often aggregated without regard to users in the Twitter populations of each community. This paper describes a simple yet effective method for building community-level models using Twitter language aggregated by user. Results on four different U.S. county-level tasks, spanning demographic, health, and psychological outcomes show large and consistent improvements in prediction accuracies (e.g. from Pearson r=.73 to .82 for median income prediction or r=.37 to .47 for life satisfaction prediction) over the standard approach of aggregating all tweets. We make our aggregated and anonymized community-level data, derived from 37 billion tweets -- over 1 billion of which were mapped to counties, available…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.