The Landscape of Toxicity: An Empirical Investigation of Toxicity on GitHub
Jaydeb Sarker, Asif Kamal Turzo, Amiangshu Bosu

TL;DR
This study provides a comprehensive empirical analysis of toxicity in GitHub OSS projects, revealing key factors influencing toxicity prevalence and characteristics, using automated detection and manual validation across thousands of projects.
Contribution
It introduces a large-scale mixed-method approach combining automated toxicity detection with manual analysis to understand toxicity patterns in GitHub OSS communities.
Findings
Profanity is the most common toxicity type.
Higher project popularity correlates with increased toxicity.
Toxic contributors are more likely to be targeted and to repeat toxic behavior.
Abstract
Toxicity on GitHub can severely impact Open Source Software (OSS) development communities. To mitigate such behavior, a better understanding of its nature and how various measurable characteristics of project contexts and participants are associated with its prevalence is necessary. To achieve this goal, we conducted a large-scale mixed-method empirical study of 2,828 GitHub-based OSS projects randomly selected based on a stratified sampling strategy. Using ToxiCR, an SE domain-specific toxicity detector, we automatically classified each comment as toxic or non-toxic. Additionally, we manually analyzed a random sample of 600 comments to validate ToxiCR's performance and gain insights into the nature of toxicity within our dataset. The results of our study suggest that profanity is the most frequent toxicity on GitHub, followed by trolling and insults. While a project's popularity is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Software Engineering Research · Digital and Cyber Forensics
