SEART Data Hub: Streamlining Large-Scale Source Code Mining and Pre-Processing
Ozren Dabi\'c, Rosalia Tufano, Gabriele Bavota

TL;DR
SEART Data Hub is a web tool that simplifies building and preprocessing large-scale source code datasets from GitHub, saving researchers time and computational resources for software engineering and deep learning studies.
Contribution
It introduces a user-friendly web application that automates dataset creation and preprocessing for large-scale source code, tailored to specific research needs.
Findings
Enables quick dataset generation within hours.
Supports customizable mining and preprocessing criteria.
Reduces time and computational costs for researchers.
Abstract
Large-scale code datasets have acquired an increasingly central role in software engineering (SE) research. This is the result of (i) the success of the mining software repositories (MSR) community, that pushed the standards of empirical studies in SE; and (ii) the recent advent of deep learning (DL) in software engineering, with models trained and tested on large source code datasets. While there exist some ready-to-use datasets in the literature, researchers often need to build and pre-process their own dataset to meet specific requirements of the study/technique they are working on. This implies a substantial cost in terms of time and computational resources. In this work we present the SEART Data Hub, a web application that allows to easily build and pre-process large-scale datasets featuring code mined from public GitHub repositories. Through a simple web interface, researchers can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Data Mining Algorithms and Applications · Scientific Computing and Data Management
