Semi-automatic Generation of Multilingual Datasets for Stance Detection in Twitter
Elena Zotova, Rodrigo Agerri, German Rigau

TL;DR
This paper introduces a semi-automatic method to generate large, balanced, and multilingual stance detection datasets for Twitter by leveraging user-based information, addressing the scarcity of resources in multiple languages.
Contribution
The paper presents a novel semi-automatic approach that reduces manual annotation effort for multilingual stance detection datasets in social media.
Findings
Method effectively creates large, balanced multilingual datasets
Empirical results demonstrate improved data quality for stance detection
Qualitative analysis confirms the method's adaptability to other NLP tasks
Abstract
Popular social media networks provide the perfect environment to study the opinions and attitudes expressed by users. While interactions in social media such as Twitter occur in many natural languages, research on stance detection (the position or attitude expressed with respect to a specific topic) within the Natural Language Processing field has largely been done for English. Although some efforts have recently been made to develop annotated data in other languages, there is a telling lack of resources to facilitate multilingual and crosslingual research on stance detection. This is partially due to the fact that manually annotating a corpus of social media texts is a difficult, slow and costly process. Furthermore, as stance is a highly domain- and topic-specific phenomenon, the need for annotated data is specially demanding. As a result, most of the manually labeled resources are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
