A Benchmark Study of the Contemporary Toxicity Detectors on Software Engineering Interactions
Jaydeb Sarker, Asif Kamal Turzo, Amiangshu Bosu

TL;DR
This study evaluates the effectiveness of existing toxicity detection tools on large-scale software engineering communication datasets, revealing significant performance degradation and proposing improvements for SE-specific toxicity detection.
Contribution
It provides a comprehensive empirical evaluation of toxicity detectors on SE data, introduces a manual labeling rubric, and offers recommendations for SE-specific toxicity detection improvements.
Findings
All tools' performance degraded significantly on SE datasets.
Performance was worse on formal communication like code reviews.
Retraining improved some models' accuracy.
Abstract
Automated filtering of toxic conversations may help an Open-source software (OSS) community to maintain healthy interactions among the project participants. Although, several general purpose tools exist to identify toxic contents, those may incorrectly flag some words commonly used in the Software Engineering (SE) context as toxic (e.g., 'junk', 'kill', and 'dump') and vice versa. To encounter this challenge, an SE specific tool has been proposed by the CMU Strudel Lab (referred as the `STRUDEL' hereinafter) by combining the output of the Perspective API with the output from a customized version of the Stanford's Politeness detector tool. However, since STRUDEL's evaluation was very limited with only 654 SE text, its practical applicability is unclear. Therefore, this study aims to empirically evaluate the Strudel tool as well as four state-of-the-art general purpose toxicity detectors…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
