TL;DR
This paper explores real-time classification of worldwide tweets at the country level using tweet-inherent features, demonstrating that combining content and metadata improves accuracy and that models trained on historical data can be effective over time.
Contribution
It introduces a comprehensive approach for global tweet country classification using inherent features and evaluates the temporal robustness of trained models.
Findings
Combining tweet content and metadata improves classification accuracy by 20-50%.
Content, self-reported location, and real name are highly useful features.
Models trained on historical data can classify new tweets effectively without retraining.
Abstract
In contrast to much previous work that has focused on location classification of tweets restricted to a specific country, here we undertake the task in a broader context by classifying global tweets at the country level, which is so far unexplored in a real-time scenario. We analyse the extent to which a tweet's country of origin can be determined by making use of eight tweet-inherent features for classification. Furthermore, we use two datasets, collected a year apart from each other, to analyse the extent to which a model trained from historical tweets can still be leveraged for classification of new tweets. With classification experiments on all 217 countries in our datasets, as well as on the top 25 countries, we offer some insights into the best use of tweet-inherent features for an accurate country-level classification of tweets. We find that the use of a single feature, such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
