Categorizing the Content of GitHub README Files

Gede Artha Azriadi Prana; Christoph Treude; Ferdian Thung; Thushari; Atapattu; David Lo

arXiv:1802.06997·cs.SE·July 31, 2018

Categorizing the Content of GitHub README Files

Gede Artha Azriadi Prana, Christoph Treude, Ferdian Thung, Thushari, Atapattu, David Lo

PDF

1 Repo

TL;DR

This paper analyzes GitHub README files to understand their content and develops a classifier that automatically categorizes sections, aiding developers in improving documentation quality and information discovery.

Contribution

It provides a systematic analysis of README content and introduces a multi-label classifier with features that effectively categorize README sections automatically.

Findings

01

Common sections discuss 'What' and 'How' of repositories.

02

Many README files lack purpose and status information.

03

Classifier achieves an F1 score of 0.746.

Abstract

README files play an essential role in shaping a developer's first impression of a software repository and in documenting the software project that the repository hosts. Yet, we lack a systematic understanding of the content of a typical README file as well as tools that can process these files automatically. To close this gap, we conduct a qualitative study involving the manual annotation of 4,226 README file sections from 393 randomly sampled GitHub repositories and we design and evaluate a classifier and a set of features that can categorize these sections automatically. We find that information discussing the `What' and `How' of a repository is very common, while many README files lack information regarding the purpose and status of a repository. Our multi-label classifier which can predict eight different categories achieves an F1 score of 0.746. To evaluate the usefulness of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gprana/READMEClassifier
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.