Identifying bot activity in GitHub pull request and issue comments
Mehdi Golzadeh, Alexandre Decan, Eleni Constantinou, Tom Mens

TL;DR
This paper presents a natural language processing classification model that effectively distinguishes between human and bot comments on GitHub pull requests and issues, aiding in unbiased socio-technical analyses.
Contribution
The study introduces a novel NLP-based classification approach using a balanced dataset and demonstrates high accuracy in identifying bot comments on GitHub.
Findings
Naive Bayes classifier achieves 0.88 F1 score
Model effectively distinguishes bot from human comments
Potential for extending to other activity types
Abstract
Development bots are used on Github to automate repetitive activities. Such bots communicate with human actors via issue comments and pull request comments. Identifying such bot comments allows preventing bias in socio-technical studies related to software development. To automate their identification, we propose a classification model based on natural language processing. Starting from a balanced ground-truth dataset of 19,282 PR and issue comments, we encode the comments as vectors using a combination of the bag of words and TF-IDF techniques. We train a range of binary classifiers to predict the type of comment (human or bot) based on this vector representation. A multinomial Naive Bayes classifier provides the best results. Its performance on a test set containing 50% of the data achieves an average precision, recall, and F1 score of 0.88. Although the model shows a promising result…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
