Efficient Prior Publication Identification for Open Source Code
Daniele Serafini (UNITO), Stefano Zacchiroli (LTCI, IP Paris)

TL;DR
This paper presents an efficient method for identifying previously published source code within large codebases by leveraging a knowledge base and Merkle DAGs, enhancing open source compliance processes.
Contribution
It introduces a novel approach and tool, swh-scanner, that efficiently detects prior publication of code parts using a Merkle DAG-based knowledge base, validated on extensive real-world data.
Findings
High efficiency in query processing and wall-clock time
Successful validation on 16,845 real-world code bases
Effective detection of previously published code segments
Abstract
Free/Open Source Software (FOSS) enables large-scale reuse of preexisting software components. The main drawback is increased complexity in software supply chain management. A common approach to tame such complexity is automated open source compliance, which consists in automating the verication of adherence to various open source management best practices about license obligation fulllment, vulnerability tracking, software composition analysis, and nearby concerns.We consider the problem of auditing a source code base to determine which of its parts have been published before, which is an important building block of automated open source compliance toolchains. Indeed, if source code allegedly developed in house is recognized as having been previously published elsewhere, alerts should be raised to investigate where it comes from and whether this entails that additional obligations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Web Application Security Vulnerabilities
