Garbage In, Garbage Out? Do Machine Learning Application Papers in Social Computing Report Where Human-Labeled Training Data Comes From?
R. Stuart Geiger, Kevin Yu, Yanlai Yang, Mindy Dai, Jie Qiu, Rebekah, Tang, Jenny Huang

TL;DR
This study examines how well machine learning papers in social computing report on the data labeling processes, revealing significant inconsistencies in documenting best practices for data quality and reliability.
Contribution
It provides a systematic analysis of reporting practices in ML social computing papers regarding data labeling, highlighting gaps and inconsistencies in documenting data quality procedures.
Findings
Wide divergence in reporting labeling practices
Many papers lack details on labeler qualifications and reliability metrics
Discrepancies in disclosure of training data availability
Abstract
Many machine learning projects for new application areas involve teams of humans who label data for a particular purpose, from hiring crowdworkers to the paper's authors labeling the data themselves. Such a task is quite similar to (or a form of) structured content analysis, which is a longstanding methodology in the social sciences and humanities, with many established best practices. In this paper, we investigate to what extent a sample of machine learning application papers in social computing --- specifically papers from ArXiv and traditional publications performing an ML classification task on Twitter data --- give specific details about whether such best practices were followed. Our team conducted multiple rounds of structured content analysis of each paper, making determinations such as: Does the paper report who the labelers were, what their qualifications were, whether they…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Explainable Artificial Intelligence (XAI)
