JEMMA: An Extensible Java Dataset for ML4Code Applications
Anjan Karmakar, Miltiadis Allamanis, Romain Robbes

TL;DR
JEMMA is a comprehensive, extensible Java dataset designed to facilitate machine learning research on source code, providing diverse representations and properties for 50,000 projects to enable experimentation and advance ML4Code models.
Contribution
We introduce JEMMA, a large-scale, high-quality, extensible Java dataset with rich source code representations and properties, aimed at lowering barriers for ML4Code research and experimentation.
Findings
Significant work remains in designing context-aware source code models.
Empirical studies show the dataset's utility for ML4Code research.
Abstract
Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software System Performance and Reliability
