Decay No More: A Persistent Twitter Dataset for Learning Social Meaning

Chiyu Zhang; Muhammad Abdul-Mageed; El Moatez Billah Nagoudi

arXiv:2204.04611·cs.CL·May 10, 2022

Decay No More: A Persistent Twitter Dataset for Learning Social Meaning

Chiyu Zhang, Muhammad Abdul-Mageed, El Moatez Billah Nagoudi

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces PTSM, a persistent Twitter dataset that uses paraphrases to address data decay, enabling more reliable social meaning research over time.

Contribution

The paper presents a novel persistent dataset for Twitter social meaning analysis that replaces original tweets with paraphrases to mitigate data decay issues.

Findings

01

Paraphrased tweets maintain similar performance to original data.

02

PTSM includes 17 datasets across 10 social meaning categories.

03

Using PTSM reduces temporal bias in social media research.

Abstract

With the proliferation of social media, many studies resort to social media to construct datasets for developing social meaning understanding systems. For the popular case of Twitter, most researchers distribute tweet IDs without the actual text contents due to the data distribution policy of the platform. One issue is that the posts become increasingly inaccessible over time, which leads to unfair comparisons and a temporal bias in social media research. To alleviate this challenge of data decay, we leverage a paraphrase model to propose a new persistent English Twitter dataset for social meaning (PTSM). PTSM consists of $17$ social meaning datasets in $10$ categories of tasks. We experiment with two SOTA pre-trained language models and show that our PTSM can substitute the actual tweets with paraphrases with marginal performance loss.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chiyuzhang94/ptsm
pytorchOfficial

Models

🤗
UBC-NLP/ptsm_t5_paraphraser
model· 4 dl
4 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Misinformation and Its Impacts · Sentiment Analysis and Opinion Mining