TL;DR
This paper introduces a new dataset of 3144 tobacco-related tweets with fine-grained labels, enabling detailed classification of tobacco mentions, topics, and demographics for improved research and surveillance.
Contribution
The creation of a labeled Twitter dataset with hierarchical annotations for fine-grained tobacco-related classification and the demonstration of standard text classification methods on it.
Findings
Standard classifiers perform effectively on the dataset
Hierarchical classification enables detailed topic and demographic analysis
Dataset facilitates future sentiment and style analysis in tobacco research
Abstract
Contemporary datasets on tobacco consumption focus on one of two topics, either public health mentions and disease surveillance, or sentiment analysis on topical tobacco products and services. However, two primary considerations are not accounted for, the language of the demographic affected and a combination of the topics mentioned above in a fine-grained classification mechanism. In this paper, we create a dataset of 3144 tweets, which are selected based on the presence of colloquial slang related to smoking and analyze it based on the semantics of the tweet. Each class is created and annotated based on the content of the tweets such that further hierarchical methods can be easily applied. Further, we prove the efficacy of standard text classification methods on this dataset, by designing experiments which do both binary as well as multi-class classification. Our experiments tackle…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
