TL;DR
This paper introduces a large, manually verified dataset of GitHub comments to train a highly accurate bot detection model, and provides an open-source tool for practitioners to identify bots in repositories.
Contribution
It presents the first large ground-truth dataset of GitHub comments for bot detection and a novel classification model with high accuracy, integrated into an open-source tool.
Findings
Achieved 0.98 precision, recall, and F1-score in bot detection
Developed a dataset with 5,000 GitHub accounts including 527 bots
Created an open-source command-line tool for bot detection
Abstract
Bots are frequently used in Github repositories to automate repetitive activities that are part of the distributed software development process. They communicate with human actors through comments. While detecting their presence is important for many reasons, no large and representative ground-truth dataset is available, nor are classification models to detect and validate bots on the basis of such a dataset. This paper proposes a ground-truth dataset, based on a manual analysis with high interrater agreement, of pull request and issue comments in 5,000 distinct Github accounts of which 527 have been identified as bots. Using this dataset we propose an automated classification model to detect bots, taking as main features the number of empty and non-empty comments of each account, the number of comment patterns, and the inequality between comments within comment patterns. We obtained a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
