DS-1000: A Natural and Reliable Benchmark for Data Science Code   Generation

Yuhang Lai; Chengxi Li; Yiming Wang; Tianyi Zhang; Ruiqi; Zhong; Luke Zettlemoyer; Scott Wen-tau Yih; Daniel Fried; Sida; Wang; Tao Yu

arXiv:2211.11501·cs.SE·November 22, 2022·31 cites

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi, Zhong, Luke Zettlemoyer, Scott Wen-tau Yih, Daniel Fried, Sida, Wang, Tao Yu

PDF

Open Access 2 Repos 1 Datasets

TL;DR

DS-1000 is a new, reliable benchmark for data science code generation that features diverse real-world problems, precise evaluation, and measures model generalization beyond memorization.

Contribution

We created DS-1000, a comprehensive benchmark with realistic problems, multi-criteria evaluation, and methods to prevent memorization, advancing data science code generation research.

Findings

01

Current best model (Codex-002) achieves 43.3% accuracy.

02

Our evaluation method is highly reliable with only 1.8% incorrect solutions.

03

DS-1000 covers diverse, practical data science problems from StackOverflow.

Abstract

We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-002-predicted solutions that our evaluation accept, only 1.8% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original StackOverflow source; consequently, models cannot answer them correctly by memorizing the solutions from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

mteb/DS1000Retrieval
dataset· 14 dl
14 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Machine Learning and Data Classification · Machine Learning in Materials Science

MethodsTest