A Large-scale Dataset of (Open Source) License Text Variants

Stefano Zacchiroli (IP Paris; LTCI)

arXiv:2204.00256·cs.SE·April 4, 2022

A Large-scale Dataset of (Open Source) License Text Variants

Stefano Zacchiroli (IP Paris, LTCI)

PDF

TL;DR

This paper presents a comprehensive large-scale dataset of open source license texts, enabling diverse research in legal text analysis, license classification, and software licensing history.

Contribution

It introduces a dataset of 6.5 million unique license files from the Software Heritage archive, with rich metadata for research and analysis.

Findings

01

Dataset enables empirical studies on open source licensing

02

Supports training of automated license classifiers

03

Facilitates NLP and historical analyses of legal texts

Abstract

We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive-the largest publicly available archive of FOSS source code with accompanying development history-all versions of files whose names are commonly used to convey licensing terms to software users and developers.The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.