Decay No More: A Persistent Twitter Dataset for Learning Social Meaning
Chiyu Zhang, Muhammad Abdul-Mageed, El Moatez Billah Nagoudi

TL;DR
This paper introduces PTSM, a persistent Twitter dataset that uses paraphrases to address data decay, enabling more reliable social meaning research over time.
Contribution
The paper presents a novel persistent dataset for Twitter social meaning analysis that replaces original tweets with paraphrases to mitigate data decay issues.
Findings
Paraphrased tweets maintain similar performance to original data.
PTSM includes 17 datasets across 10 social meaning categories.
Using PTSM reduces temporal bias in social media research.
Abstract
With the proliferation of social media, many studies resort to social media to construct datasets for developing social meaning understanding systems. For the popular case of Twitter, most researchers distribute tweet IDs without the actual text contents due to the data distribution policy of the platform. One issue is that the posts become increasingly inaccessible over time, which leads to unfair comparisons and a temporal bias in social media research. To alleviate this challenge of data decay, we leverage a paraphrase model to propose a new persistent English Twitter dataset for social meaning (PTSM). PTSM consists of social meaning datasets in categories of tasks. We experiment with two SOTA pre-trained language models and show that our PTSM can substitute the actual tweets with paraphrases with marginal performance loss.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Misinformation and Its Impacts · Sentiment Analysis and Opinion Mining
