DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation
Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi, Zhong, Luke Zettlemoyer, Scott Wen-tau Yih, Daniel Fried, Sida, Wang, Tao Yu

TL;DR
DS-1000 is a new, reliable benchmark for data science code generation that features diverse real-world problems, precise evaluation, and measures model generalization beyond memorization.
Contribution
We created DS-1000, a comprehensive benchmark with realistic problems, multi-criteria evaluation, and methods to prevent memorization, advancing data science code generation research.
Findings
Current best model (Codex-002) achieves 43.3% accuracy.
Our evaluation method is highly reliable with only 1.8% incorrect solutions.
DS-1000 covers diverse, practical data science problems from StackOverflow.
Abstract
We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-002-predicted solutions that our evaluation accept, only 1.8% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original StackOverflow source; consequently, models cannot answer them correctly by memorizing the solutions from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Machine Learning and Data Classification · Machine Learning in Materials Science
MethodsTest
