Methods2Test: A dataset of focal methods mapped to test cases
Michele Tufano, Shao Kun Deng, Neel Sundaresan, Alexey Svyatkovskiy

TL;DR
Methods2Test is a large, publicly available dataset linking Java methods to their corresponding unit tests, designed to support machine learning research in automated test generation and software testing analysis.
Contribution
The paper introduces Methods2Test, a comprehensive dataset of nearly 781,000 method-test pairs with rich metadata, enabling improved machine learning models for automated unit test generation.
Findings
Created a dataset with 780,944 method-test pairs from 91,385 Java projects.
Developed heuristics to reliably map test cases to focal methods.
Provided textual data at multiple context levels for ML training and evaluation.
Abstract
Unit testing is an essential part of the software development process, which helps to identify issues with source code in early stages of development and prevent regressions. Machine learning has emerged as viable approach to help software developers generate automated unit tests. However, generating reliable unit test cases that are semantically correct and capable of catching software bugs or unintended behavior via machine learning requires large, metadata-rich, datasets. In this paper we present Methods2Test: A dataset of focal methods mapped to test cases: a large, supervised dataset of test cases mapped to corresponding methods under test (i.e., focal methods). This dataset contains 780,944 pairs of JUnit tests and focal methods, extracted from a total of 91,385 Java open source projects hosted on GitHub with licenses permitting re-distribution. The main challenge behind the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
