Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff Size
Martin Monperrus, Matias Martinez, He Ye, Fernanda Madeiral, and Thomas Durieux, Zhongxing Yu

TL;DR
Megadiff is a large, carefully curated dataset of 663,029 Java code diffs designed to support research in code change analysis, fault localization, and machine learning applications involving source code modifications.
Contribution
The paper introduces Megadiff, a new extensive dataset of Java diffs with strict inclusion criteria, enabling advanced research in code comprehension and automated repair.
Findings
Contains 663,029 Java diffs suitable for machine learning.
Supports research in commit comprehension and fault localization.
Facilitates development of automated program repair tools.
Abstract
This paper presents Megadiff, a dataset of source code diffs. It focuses on Java, with strict inclusion criteria based on commit message and diff size. Megadiff contains 663 029 Java diffs that can be used for research on commit comprehension, fault localization, automated program repair, and machine learning on code changes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Advanced Malware Detection Techniques
