Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research
Yida Mu, Mali Jin, Xingyi Song, Nikolaos Aletras

TL;DR
This paper analyzes social media datasets used in NLP for CSS, revealing duplication issues that affect data quality and model performance, and proposes protocols to improve dataset development and usage.
Contribution
It provides a comprehensive examination of data duplication in social media datasets for NLP in CSS and offers new protocols to enhance data quality and reliability.
Findings
Social media datasets show significant duplication levels.
Data duplication affects model performance claims.
Proposed best practices improve dataset reliability.
Abstract
Research in natural language processing (NLP) for Computational Social Science (CSS) heavily relies on data from social media platforms. This data plays a crucial role in the development of models for analysing socio-linguistic phenomena within online communities. In this work, we conduct an in-depth examination of 20 datasets extensively used in NLP for CSS to comprehensively examine data quality. Our analysis reveals that social media datasets exhibit varying levels of data duplication. Consequently, this gives rise to challenges like label inconsistencies and data leakage, compromising the reliability of models. Our findings also suggest that data duplication has an impact on the current claims of state-of-the-art performance, potentially leading to an overestimation of model effectiveness in real-world scenarios. Finally, we propose new protocols and best practices for improving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsData Quality and Management · Big Data Technologies and Applications
