GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries

Nuno Fachada; Daniel Fernandes; Carlos M. Fernandes; Bruno D. Ferreira-Saraiva; Jo\~ao P. Matos-Carvalho

arXiv:2508.00033·cs.SE·September 17, 2025

GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries

Nuno Fachada, Daniel Fernandes, Carlos M. Fernandes, Bruno D. Ferreira-Saraiva, Jo\~ao P. Matos-Carvalho

PDF

TL;DR

This paper benchmarks various large language models, especially GPT-4.1, for automated Python code generation in scientific experiments, demonstrating GPT-4.1's superior performance and highlighting current limitations in LLM-based scientific automation.

Contribution

It provides a systematic evaluation of LLMs' ability to generate functional Python code for complex scientific tasks, introducing novel benchmarking methods and revealing GPT-4.1's exceptional success.

Findings

01

GPT-4.1 achieved 100% success rate in code generation tasks.

02

Most models succeeded in fewer than half of the runs.

03

Identified shortcomings in third-party libraries affecting code execution.

Abstract

Large Language Models (LLMs) have advanced rapidly as tools for automating code generation in scientific research, yet their ability to interpret and use unfamiliar Python APIs for complex computational experiments remains poorly characterized. This study systematically benchmarks a selection of state-of-the-art LLMs in generating functional Python code for two increasingly challenging scenarios: conversational data analysis with the \textit{ParShift} library, and synthetic data generation and clustering using \textit{pyclugen} and \textit{scikit-learn}. Both experiments use structured, zero-shot prompts specifying detailed requirements but omitting in-context examples. Model outputs are evaluated quantitatively for functional correctness and prompt compliance over multiple runs, and qualitatively by analyzing the errors produced when code execution fails. Results show that only a small…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.