Tooling for Time- and Space-efficient git Repository Mining

Fabian Heseding; Willy Scheibel; J\"urgen D\"ollner

arXiv:2205.01351·cs.SE·May 4, 2022

Tooling for Time- and Space-efficient git Repository Mining

Fabian Heseding, Willy Scheibel, J\"urgen D\"ollner

PDF

1 Repo

TL;DR

This paper introduces pyrepositoryminer, a Python tool that optimizes git repository traversal and data extraction, significantly improving speed and efficiency for large-scale static source code analysis tasks.

Contribution

The paper presents a novel, highly optimized command-line tool combining multiple techniques for efficient repository mining, adaptable to various metrics and external data extraction methods.

Findings

01

Single-thread speedup of 15.6x over existing tools

02

Multi-threaded execution achieves up to 86.9x speedup with 12 threads

03

Effective optimization techniques enable scalable analysis of large repositories

Abstract

Software projects under version control grow with each commit, accumulating up to hundreds of thousands of commits per repository. Especially for such large projects, the traversal of a repository and data extraction for static source code analysis poses a trade-off between granularity and speed. We showcase the command-line tool pyrepositoryminer that combines a set of optimization approaches for efficient traversal and data extraction from git repositories while being adaptable to third-party and custom software metrics and data extractions. The tool is written in Python and combines bare repository access, in-memory storage, parallelization, caching, change-based analysis, and optimized communication between the traversal and custom data extraction components. The tool allows for both metrics written in Python and external programs for data extraction. A single-thread performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fabianhe/pyrepositoryminer
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.