GitHub Proxy Server: A tool for supporting massive data collection on GitHub

Hudson Silva Borges; Marco Tulio Valente

arXiv:2505.18305·cs.SE·May 27, 2025

GitHub Proxy Server: A tool for supporting massive data collection on GitHub

Hudson Silva Borges, Marco Tulio Valente

PDF

TL;DR

The paper introduces GitHub Proxy Server, a cross-platform tool that simplifies large-scale data collection from GitHub by overcoming API restrictions, thereby enhancing mining performance.

Contribution

It presents a novel, system-independent proxy server that streamlines massive data collection from GitHub, addressing API limitations and operational complexities.

Findings

01

Improved data collection performance on GitHub

02

Simplified handling of API restrictions

03

Platform-independent data mining support

Abstract

GitHub is the most popular social coding platform and widely used by developers and organizations to host their open-source projects around the world. Besides that, the platform has a web API that allow developers collect information from public repositories hosted on it. However, collecting massive amount of data from GitHub can be very challenging due to existing restrictions and abuse detection mechanisms. In this work, we present a tool, called GitHub Proxy Server, which abstracts such complexities into a tool that is independent on operational system and programming language. We show that, using the proposed tool, it is possible to improve the performance of GitHub mining tasks without any additional complexities.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.