Identifying bot activity in GitHub pull request and issue comments

Mehdi Golzadeh; Alexandre Decan; Eleni Constantinou; Tom Mens

arXiv:2103.06042·cs.SE·March 11, 2021

Identifying bot activity in GitHub pull request and issue comments

Mehdi Golzadeh, Alexandre Decan, Eleni Constantinou, Tom Mens

PDF

TL;DR

This paper presents a natural language processing classification model that effectively distinguishes between human and bot comments on GitHub pull requests and issues, aiding in unbiased socio-technical analyses.

Contribution

The study introduces a novel NLP-based classification approach using a balanced dataset and demonstrates high accuracy in identifying bot comments on GitHub.

Findings

01

Naive Bayes classifier achieves 0.88 F1 score

02

Model effectively distinguishes bot from human comments

03

Potential for extending to other activity types

Abstract

Development bots are used on Github to automate repetitive activities. Such bots communicate with human actors via issue comments and pull request comments. Identifying such bot comments allows preventing bias in socio-technical studies related to software development. To automate their identification, we propose a classification model based on natural language processing. Starting from a balanced ground-truth dataset of 19,282 PR and issue comments, we encode the comments as vectors using a combination of the bag of words and TF-IDF techniques. We train a range of binary classifiers to predict the type of comment (human or bot) based on this vector representation. A multinomial Naive Bayes classifier provides the best results. Its performance on a test set containing 50% of the data achieves an average precision, recall, and F1 score of 0.88. Although the model shows a promising result…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.