Code4ML: a Large-scale Dataset of annotated Machine Learning Code
Anastasia Drozdova, Polina Guseva, Ekaterina Trofimova, Anna, Scherbakova, Andrey Ustyuzhanin

TL;DR
Code4ML is a large, annotated dataset of approximately 2.5 million machine learning code snippets from Kaggle, designed to facilitate tasks like code classification, auto-completion, and generation in data science applications.
Contribution
The paper introduces Code4ML, a comprehensive annotated dataset of ML code snippets from Kaggle, enabling advanced machine learning tasks on code.
Findings
Contains ~2.5 million ML code snippets from Kaggle
Includes human-annotated subset for supervised tasks
Supports applications like code classification and auto-generation
Abstract
Program code as a data source is gaining popularity in the data science community. Possible applications for models trained on such assets range from classification for data dimensionality reduction to automatic code generation. However, without annotation number of methods that could be applied is somewhat limited. To address the lack of annotated datasets, we present the Code4ML corpus. It contains code snippets, task summaries, competitions and dataset descriptions publicly available from Kaggle - the leading platform for hosting data science competitions. The corpus consists of ~2.5 million snippets of ML code collected from ~100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose. Code4ML dataset can potentially help address a number of software engineering or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Machine Learning and Data Classification · Advanced Malware Detection Techniques
