TL;DR
This paper analyzes GitHub README files to understand their content and develops a classifier that automatically categorizes sections, aiding developers in improving documentation quality and information discovery.
Contribution
It provides a systematic analysis of README content and introduces a multi-label classifier with features that effectively categorize README sections automatically.
Findings
Common sections discuss 'What' and 'How' of repositories.
Many README files lack purpose and status information.
Classifier achieves an F1 score of 0.746.
Abstract
README files play an essential role in shaping a developer's first impression of a software repository and in documenting the software project that the repository hosts. Yet, we lack a systematic understanding of the content of a typical README file as well as tools that can process these files automatically. To close this gap, we conduct a qualitative study involving the manual annotation of 4,226 README file sections from 393 randomly sampled GitHub repositories and we design and evaluate a classifier and a set of features that can categorize these sections automatically. We find that information discussing the `What' and `How' of a repository is very common, while many README files lack information regarding the purpose and status of a repository. Our multi-label classifier which can predict eight different categories achieves an F1 score of 0.746. To evaluate the usefulness of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
