A ground-truth dataset and classification model for detecting bots in   GitHub issue and PR comments

Mehdi Golzadeh; Alexandre Decan; Damien Legay; Tom Mens

arXiv:2010.03303·cs.SE·January 29, 2021

A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments

Mehdi Golzadeh, Alexandre Decan, Damien Legay, Tom Mens

PDF

3 Repos

TL;DR

This paper introduces a large, manually verified dataset of GitHub comments to train a highly accurate bot detection model, and provides an open-source tool for practitioners to identify bots in repositories.

Contribution

It presents the first large ground-truth dataset of GitHub comments for bot detection and a novel classification model with high accuracy, integrated into an open-source tool.

Findings

01

Achieved 0.98 precision, recall, and F1-score in bot detection

02

Developed a dataset with 5,000 GitHub accounts including 527 bots

03

Created an open-source command-line tool for bot detection

Abstract

Bots are frequently used in Github repositories to automate repetitive activities that are part of the distributed software development process. They communicate with human actors through comments. While detecting their presence is important for many reasons, no large and representative ground-truth dataset is available, nor are classification models to detect and validate bots on the basis of such a dataset. This paper proposes a ground-truth dataset, based on a manual analysis with high interrater agreement, of pull request and issue comments in 5,000 distinct Github accounts of which 527 have been identified as bots. Using this dataset we propose an automated classification model to detect bots, taking as main features the number of empty and non-empty comments of each account, the number of comment patterns, and the inequality between comments within comment patterns. We obtained a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.