What's Inside a GitHub Repository? An Empirical Study on the Contents of 10K Projects
Andre Hora, Jo\~ao Eduardo Montandon, Diego Elias Costa

TL;DR
This empirical study analyzes 10,000 GitHub repositories over a decade, revealing evolving content patterns, standard artifacts, and emerging trends like AI-related files, providing insights into open source development dynamics.
Contribution
It offers the first large-scale empirical analysis of GitHub repository contents, highlighting major content changes and technological trends over ten years.
Findings
Standardization of README.md, .gitignore, and LICENSE files.
Rise of GitHub Actions as the main CI/CD tool.
Growth of configuration formats like TOML, YAML, JSON.
Abstract
GitHub is the largest code hosting platform, with millions of repositories spanning multiple technologies. Despite this, little is known about the actual contents of GitHub's repositories in the wild. This paper presents an initial empirical analysis to better understand the contents of real-world GitHub repositories. We analyze the files, directories, and extensions present in 10,000 GitHub repositories, as well as their evolution over ten years. Our results show major changes in GitHub over the last decade: (1) the consolidation of README.md, .gitignore, and LICENSE as standard artifacts; (2) the rise of GitHub Actions as the dominant CI/CD platform; (3) the growth of configuration formats such as TOML, YAML, and JSON, alongside a decline in XML; (4) new trends, such as the growth of Dockerfile; and (5) emerging content related to LLMs and generative AI (e.g., AGENTS.md). Based on our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
