The Software Heritage License Dataset (2022 Edition)
Jes\'us M. Gonz\'alez-Barahona (URJC), Sergio Montes-Leon (URJC),, Gregorio Robles (URJC), Stefano Zacchiroli (IP Paris, LTCI)

TL;DR
This paper presents a comprehensive dataset of 6.9 million unique software license files extracted from the Software Heritage archive, enabling large-scale license analysis and research.
Contribution
It introduces a large, well-characterized dataset of software licenses with metadata, covering nearly all known public license texts, for research and practical applications.
Findings
Dataset contains 6.9 million unique license files.
Includes metadata like MIME type, SPDX license, and first appearance.
Manual analysis of 8,102 documents provides ground truth.
Abstract
Context: When software is released publicly, it is common to include with it either the full text of the license or licenses under which it is published, or a detailed reference to them. Therefore public licenses, including FOSS (free, open source software) licenses, are usually publicly available in source code repositories.Objective: To compile a dataset containing as many documents as possible that contain the text of software licenses, or references to the license terms. Once compiled, characterize the dataset so that it can be used for further research, or practical purposes related to license analysis.Method: Retrieve from Software Heritage-the largest publicly available archive of FOSS source code-all versions of all files whose names are commonly used to convey licensing terms. All retrieved documents will be characterized in various ways, using automated and manual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
