TL;DR
This paper investigates the impact of training data quality, especially human annotation agreement, on multilingual Twitter sentiment classification, showing that larger, high-quality datasets lead to models approaching human agreement levels.
Contribution
It demonstrates that data quality and size are more critical than model type, and highlights the importance of monitoring annotator agreement for better sentiment classification.
Findings
Model performance aligns with inter-annotator agreement with sufficient data size.
Training data quality significantly influences classification accuracy.
Humans perceive sentiment classes as ordered.
Abstract
What are the limits of automated Twitter sentiment classification? We analyze a large set of manually labeled tweets in different languages, use them as training data, and construct automated classification models. It turns out that the quality of classification models depends much more on the quality and size of training data than on the type of the model trained. Experimental results indicate that there is no statistically significant difference between the performance of the top classification models. We quantify the quality of training data by applying various annotator agreement measures, and identify the weakest points of different datasets. We show that the model performance approaches the inter-annotator agreement when the size of the training set is sufficiently large. However, it is crucial to regularly monitor the self- and inter-annotator agreements since this improves the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
