Recipe for Discovery: A Pipeline for Institutional Open Source Activity
Juanita Gomez, Emily Lovell, Stephanie Lieggi, Alvaro A. Cardenas, James Davis

TL;DR
This paper introduces a comprehensive pipeline that systematically discovers and analyzes open source projects across multiple universities, providing insights into institutional practices and highlighting areas for improvement in open source engagement.
Contribution
It presents a novel end-to-end framework for identifying and analyzing academic open source projects at scale, including data collection, metadata extraction, and classification.
Findings
Identified over 200,000 repositories across ten universities.
Revealed common issues like missing licenses and limited community engagement.
Provided actionable insights for policy and support improvements.
Abstract
Open source software development, particularly within institutions such as universities and research laboratories, is often decentralized and difficult to track. Although academic teams produce many impactful scientific tools, their projects do not always follow consistent open source practices, such as clear licensing, documentation, or community engagement. As a result, these efforts often go unrecognized due to limited visibility and institutional awareness, and the software itself can be difficult to sustain over time. This paper presents an end-to-end framework for systematically discovering and analyzing open source projects across distributed academic systems. Using ten universities as a case study, we build a pipeline that collects data via GitHub's REST API, extracts metadata, and predicts both institutional affiliation and project type (e.g., development tools, educational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
