DRAGON: Robust Classification for Very Large Collections of Software Repositories
Stefano Balla (DISI), Stefano Zacchiroli (IP Paris, LTCI, ACES, INFRES), Thomas Degueule (LaBRI, UB), Jean-R\'emy Falleri (LaBRI, UB), Romain Robbes (LaBRI, UB)

TL;DR
DRAGON is a robust, scalable classifier for large software repositories that effectively uses lightweight signals like file names and README files, outperforming existing methods especially when documentation is sparse.
Contribution
The paper introduces DRAGON, a novel repository classification approach that operates effectively without relying heavily on README files, suitable for large, diverse collections.
Findings
DRAGON improves F1@5 from 54.8% to 60.8%.
Performance degrades only 6% without README files.
The authors release a dataset of 825,000 repositories for future research.
Abstract
The ability to automatically classify source code repositories with ''topics'' that reflect their content and purpose is very useful, especially when navigating or searching through large software collections. However, existing approaches often rely heavily on README files and other metadata, which are frequently missing, limiting their applicability in real-world large-scale settings. We present DRAGON, a repository classifier designed for very large and diverse software collections. It operates entirely on lightweight signals commonly stored in version control systems: file and directory names, and optionally the README when available. In repository classification at scale, DRAGON improves F1@5 from 54.8% to 60.8%, surpassing the state of the art. DRAGON remains effective even when README files are absent, with performance degrading by only 6% w.r.t. when they are present. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Scientific Computing and Data Management · Web Data Mining and Analysis
