Distant Supervision for Topic Classification of Tweets in Curated Streams
Salman Mohammed, Nimesh Ghelani, and Jimmy Lin

TL;DR
This paper presents a novel method using distant supervision from curated streams to train topic classifiers for tweets, enabling effective and adaptive categorization in a noisy, real-time environment.
Contribution
It introduces a semi-automatic approach to generate labeled data from curated streams, improving topic classification of tweets with minimal manual labeling.
Findings
Classifiers trained with this method perform well on noisy and human-labeled data.
The approach adapts to topic drift in Twitter news streams.
It enables dynamic, real-time topic classification of tweets.
Abstract
We tackle the challenge of topic classification of tweets in the context of analyzing a large collection of curated streams by news outlets and other organizations to deliver relevant content to users. Our approach is novel in applying distant supervision based on semi-automatically identifying curated streams that are topically focused (for example, on politics, entertainment, or sports). These streams provide a source of labeled data to train topic classifiers that can then be applied to categorize tweets from more topically-diffuse streams. Experiments on both noisy labels and human ground-truth judgments demonstrate that our approach yields good topic classifiers essentially "for free", and that topic classifiers trained in this manner are able to dynamically adjust for topic drift as news on Twitter evolves.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Spam and Phishing Detection · Topic Modeling
