CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks
Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo, Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury,, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam Ramji,, Ulrich Finkler, Susan Malaika, Frederick Reiss

TL;DR
CodeNet is a comprehensive large-scale dataset with over 14 million code samples across 55 languages, designed to advance AI research in coding tasks such as classification, translation, and performance optimization.
Contribution
It introduces a massive, annotated dataset for AI-driven code understanding and transformation, facilitating diverse research in software engineering and machine learning.
Findings
Successful code classification and similarity experiments
Rich annotations enable benchmarking of AI models
Sample test sets support code correctness verification
Abstract
Over the last several decades, software has been woven into the fabric of every aspect of our society. As software development surges and code infrastructure of enterprise applications ages, it is now more critical than ever to increase software development productivity and modernize legacy applications. Advances in deep learning and machine learning algorithms have enabled numerous breakthroughs, motivating researchers to leverage AI techniques to improve software development efficiency. Thus, the fast-emerging research area of AI for Code has garnered new interest and gathered momentum. In this paper, we present a large-scale dataset CodeNet, consisting of over 14 million code samples and about 500 million lines of code in 55 different programming languages, which is aimed at teaching AI to code. In addition to its large scale, CodeNet has a rich set of high-quality annotations to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
