Repo2Vec: A Comprehensive Embedding Approach for Determining Repository Similarity
Md Omar Faruk Rokon, Pei Yan, Risul Islam, Michalis Faloutsos

TL;DR
Repo2Vec is a novel embedding method that combines metadata, structure, and source code to accurately represent repositories for similarity detection and classification tasks.
Contribution
It introduces a comprehensive embedding approach that integrates multiple repository features, outperforming previous methods in accuracy and enabling advanced repository analysis.
Findings
Achieves 93% precision in repository similarity detection.
Distinguishes malware from benign repositories with 98% precision.
Supports meaningful hierarchical clustering of repositories.
Abstract
How can we identify similar repositories and clusters among a large online archive, such as GitHub? Determiningrepository similarity is an essential building block in studying the dynamics and the evolution of such software ecosystems. The key challenge is to determine the right representation for the diverse repository features in a way that: (a) it captures all aspects of the available information, and (b) it is readily usable by MLalgorithms. We propose Repo2Vec, a comprehensive embedding approach to represent a repository as a distributed vector by combining features from three types of information sources. As our key novelty, we consider three types of information: (a)metadata, (b) the structure of the repository, and (c) the source code. We also introduce a series of embedding approaches to represent and combine these information types into a single embedding. We evaluate our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Misinformation and Its Impacts
