SciCode: A Research Coding Benchmark Curated by Scientists

Minyang Tian; Luyu Gao; Shizhuo Dylan Zhang; Xinan Chen; Cunwei Fan,; Xuefei Guo; Roland Haas; Pan Ji; Kittithat Krongchon; Yao Li; Shengyan Liu,; Di Luo; Yutao Ma; Hao Tong; Kha Trinh; Chenyu Tian; Zihan Wang; Bohao Wu,; Yanyu Xiong; Shengzhu Yin; Minhui Zhu; Kilian Lieret; Yanxin Lu; Genglin Liu,; Yufeng Du; Tianhua Tao; Ofir Press; Jamie Callan; Eliu Huerta; Hao Peng

arXiv:2407.13168·cs.AI·July 19, 2024·3 cites

SciCode: A Research Coding Benchmark Curated by Scientists

Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan,, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu,, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu,, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret

PDF

Open Access 2 Datasets 1 Video

TL;DR

SciCode is a curated benchmark of 338 scientific research coding problems from 16 fields, designed to evaluate language models' ability to solve complex, real-world scientific tasks, revealing current limitations.

Contribution

This paper introduces SciCode, a novel, scientist-curated benchmark for evaluating AI models on real scientific coding problems across multiple disciplines.

Findings

01

Claude3.5-Sonnet solves only 4.6% of problems in realistic settings

02

SciCode highlights current AI limitations in scientific problem-solving

03

Benchmark facilitates future development of scientific AI tools

Abstract

Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

OpenAI: ‘We Just Reached Human-level Reasoning’.· youtube

Taxonomy

TopicsScientific Computing and Data Management · Research Data Management Practices · Genetics, Bioinformatics, and Biomedical Research