pyMethods2Test: A Dataset of Python Tests Mapped to Focal Methods

Idriss Abdelmadjid; Robert Dyer

arXiv:2502.05143·cs.SE·February 10, 2025

pyMethods2Test: A Dataset of Python Tests Mapped to Focal Methods

Idriss Abdelmadjid, Robert Dyer

PDF

Open Access

TL;DR

This paper introduces pyMethods2Test, a large dataset of Python unit tests mapped to specific methods, facilitating training of language models for code testing and analysis.

Contribution

It presents a novel large-scale dataset of Python tests with explicit mappings to focal methods, created through heuristics on GitHub projects, filling a gap in available Python testing datasets.

Findings

01

Analyzed over 88K GitHub projects with Python tests

02

Extracted over 22 million test methods and 2 million method mappings

03

Provides a publicly available dataset for training and evaluating code testing models

Abstract

Python is one of the fastest-growing programming languages and currently ranks as the top language in many lists, even recently overtaking JavaScript as the top language on GitHub. Given its importance in data science and machine learning, it is imperative to be able to effectively train LLMs to generate good unit test cases for Python code. This motivates the need for a large dataset to provide training and testing data. To date, while other large datasets exist for languages like Java, none publicly exist for Python. Python poses difficult challenges in generating such a dataset, due to its less rigid naming requirements. In this work, we consider two commonly used Python unit testing frameworks: Pytest and unittest. We analyze a large corpus of over 88K open-source GitHub projects utilizing these testing frameworks. Using a carefully designed set of heuristics, we are able to locate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Physics and Python Applications