From articles to code: on-demand generation of core algorithms from scientific publications
Cameron S Movassaghi, Amanda Momenzadeh, Jesse G Meyer

TL;DR
This paper explores using AI to generate code from scientific papers, showing it can work well when papers are detailed enough.
Contribution
The study introduces a benchmark for evaluating code generation from scientific publications using large language models.
Findings
LLMs can reproduce package-level functionality from papers with performance matching established libraries.
Failures occurred when papers lacked sufficient implementation details or data structure specifications.
The results highlight where publication standards currently hinder reproducibility in computational methods.
Abstract
Scientific software packages impose persistent maintenance costs due to dependency churn, version incompatibilities, and bug triage, even when the underlying algorithms are stable and well described. At the same time, peer-reviewed publications already function as the canonical record of many computational methods, yet translating narrative method descriptions into usable code remains labor-intensive and error-prone. Recent advances in large language models (LLMs) raise the question of whether published articles alone can serve as sufficient specifications for on-demand code generation, potentially reducing reliance on continuously maintained libraries. We systematically evaluated state-of-the-art LLMs by tasking them with implementing core algorithms using only the original scientific publications as input. Across a diverse benchmark including random forests, batch correction methods,…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6| Any edge | Positive edge metrics | Negative edge metrics | |
|---|---|---|---|
| Jaccard | 1 | 0.331 | 0.297 |
| Precision | 1 | 0.508 | 0.448 |
| Recall | 1 | 0.487 | 0.469 |
| F1 score | 1 | 0.497 | 0.458 |
| Random Forest | Combat | Augusta | SERRF | GSEA | |
|---|---|---|---|---|---|
| Claude 4 Sonnet | N | Y | N/A | N/A | N/A |
| GPT-o4-mini-high | Y | Y | Y | Y | Y |
| Gemini Pro 2.5 | N | Y | N/A | N | N/A |
| GPT-4o | NA | N | N/A | N/A | N |
| Gemini Flash 2.5 | NA | N | N/A | N | N |
- —National Institute of Health through the National Institute for General Medical Sciences (NIGMS)
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Software Engineering Research · Machine Learning in Materials Science
1 Introduction
Scientific articles serve as the foundation of reproducible computational research, providing detailed, peer-reviewed descriptions of novel algorithms and methodologies. However, the journey from paper to robust, usable software is fraught with challenges. Translating narrative algorithmic insights into production-grade implementations remains a labor-intensive process, frequently hindered by ambiguities or omitted practical details in published work. This gap is widely recognized as a core barrier to scientific transparency and the credibility of computational findings, with reproducibility crises and failed software reimplementations demonstrating the difficulties researchers face in reusing published work (Miłkowski et al. 2018, Ziemann et al. 2023).
To bridge this divide, software libraries provide powerful abstractions, encapsulating complex scientific and statistical methods behind user-friendly APIs. It is well accepted that code libraries can enable reproducible research (Peng 2011, Ziemann et al. 2023). While these libraries accelerate research, their maintenance is a monumental task. Key challenges include:
Prominent and deeply interwoven dependency chains, where updates in foundational packages propagate breaking changes throughout the ecosystem, creating a delicate balance between innovation and stability (Tsakpinis 2023);Subtle edge-case bugs that require targeted, labor-intensive fixes, often only identified through extensive community usage and evolving scientific requirements;Version mismatches and a lack of standardized compute environments undermine the reproducibility of scientific research.
Recent advances in large language models (LLMs) for code generation, such as OpenAI Codex (Chen et al. 2021) and DeepMind AlphaCode (Li et al. 2022), mark a paradigm shift in how computational methods can be instantiated. These models, trained on billions of lines of code and natural language, can translate natural-language problem descriptions into executable software. In several benchmarks, these models not only accelerate development but also demonstrate competitive performance in real-world tasks, with capabilities advancing rapidly across programming languages and domains. However, success rates vary: while straightforward problems are effectively solved, multi-step scientific workflows and nuanced research tasks remain a challenge for current LLMs (Chen et al. 2021, Li et al. 2022).
When paired with retrieval-augmented generation (RAG) frameworks (Lewis et al. 2021), LLMs can fetch precise algorithmic details or API documentation from curated corpora, including scientific articles and official documentation, at inference time. This approach enables a model to synthesize code directly guided by original research, rather than relying solely on historical training data. By dynamically integrating this external knowledge just-in-time, the distinction between published algorithm and runnable implementation begins to fade, hinting at a possible future in which articles operate as executable specifications in certain contexts.
In this study, we systematically probe the limits of LLM-driven code synthesis using exclusively literature-sourced descriptions. We present a comprehensive benchmark, tasking several state-of-the-art models with reimplementing core methods from popular Python libraries, guided only by a prompt and the original publication’s text (see Fig. 1). Outputs are rigorously compared against package-standard behaviors across diverse test suites. These results highlight where on-demand, literature-driven generation can succeed, where current models fall short, and assess how close we are to a future in which flexible, article-driven implementations supersede traditional package maintenance. Because our goal is to examine feasibility rather than exhaustively survey all algorithm classes, we frame this study as a proof of concept centered on a representative set of diverse algorithms.
Concept overview. Prompts were passed to various models requesting that methods be created based only on the academic manuscript describing the algorithm. The on-demand generated Python code was then evaluated with regard to whether it could successfully run at all, and then qualitative and quantitative metrics of performance relative to the intended goal were computed to compare various versions of each method.
2 Materials and methods
2.1 Code and data availability
The code, data, and PDF files used in this paper are available from https://github.com/xomicsdatascience/articles-to-code.
2.2 Random Forest
The original PDF of the paper (Breiman 2001) was attached, and the prompt was: “Forget all our previous chats and forget everything you know about the random forest method for machine learning. Using only the information in the attached PDF, try your hardest to create an implementation of a random forest that exactly matches the algorithm for classification. Your function should work with the iris dataset from sklearn to classify the flower type. You only get one try to do this correctly, so please try very carefully to get the code correct.” The generated code is in RF-compare-models.ipynb.
2.3 Combat
The original PDF paper (Johnson et al. 2007) was attached, and the first prompt was: “Forget all our previous chats and forget everything you know about the batch correction method called combat. Using only the information in the attached PDF, try your best to create an implementation of combat that exactly matches the math described to correct omics data across batches. It should take an input of a dataframe where columns represent samples and rows represent genes, along with a list of batches for each sample (in the same order as the columns of the dataframe), and then return the corrected dataframe. You only get one try to do this correctly, so please try very carefully to get the code correct.” The generated code, including the comparative analysis, is available in combat-compare-models.ipynb.
2.4 Augusta
The PDF of the paper (Musilova et al. 2024) was attached along with the CSV with data from the relevant Dialogue on Reverse Engineering Assessment and Methods (DREAM) challenge (Meyer and Saez-Rodriguez 2021) (available from their GitHub repository). The following prompt was used: “Forget anything you know about the method called Augusta, and ignore any previous chats about that method. Use the PDF paper I uploaded and implement their method for gene regulatory network discovery from scratch in Python. Implement the version without requiring motifs from a genome. Use the timeseries of gene expression in the Ecoli_DREAM4.csv file to apply that algorithm to detect a GRN. Your code should produce a GRN as a matrix of gene x gene, where each cell gives a 1 if a positive relation or -1 if a negative relation. Run the code yourself and solve any errors that come up until it works.” The generated code, including the comparison to their pip package, is available on GitHub in augusta.ipynb.
2.5 Systematic error removal by random Forest (SERRF)
Within the Google Collab environment, a PDF of the paper (Fan et al. 2019) was attached along with an Excel file with example data. The following prompt was used: “Forget all our previous chats and forget everything you know about the batch correction method called systematic error removal using random forest (SERRF). Using only the information in the attached PDF, try your best to create an implementation of SERRF that exactly matches the math described to correct omics data across batches. It should take a pandas DataFrame as input and return a pandas DataFrame containing the batch-corrected omics data. Test the implementation using the provided example dataset. You only get one try to do this correctly so please try very carefully to get the code correct.” The resulting code is available on GitHub in three separate ipynb files: “SERFF from repo url.ipynb,” “SERRF from gitingest.ipynb,” and “SERRF from pdf.ipynb.”
A similar prompt was then attempted with the PDF and the example data using gpt-o4-mini-high: “Forget all our previous chats and forget everything you know about the batch correction method called systematic error removal using random forest (SERRF). Using only the information in the attached PDF, try your best to create an implementation of SERRF that exactly matches the math described to correct omics data across batches. It should take a pandas DataFrame as input and return a pandas DataFrame containing the batch-corrected omics data. Test the implementation using the provided example dataset. You only get one try to do this correctly so please try very carefully to get the code correct. Iteratively try to solve any errors that pop up and run the code yourself to test it until it works.”
After that failed, it became clear that the multi-index structure of the example dataset without instruction was the problem. A second prompt was tried: “Forget all our previous chats and forget everything you know about the batch correction method called systematic error removal using random forest (SERRF). Using only the information in the attached PDF, try your best to create an implementation of SERRF that exactly matches the math described to correct omics data across batches. It should take a pandas DataFrame as input and return a pandas DataFrame containing the batch-corrected omics data. Test the implementation using the provided example dataset. You only get one try to do this correctly so please try very carefully to get the code correct. Iteratively try to solve any errors that pop up and run the code yourself to test it until it works. The data structure is in the attached picture. There are multiple columns labels to identify what each row is. The first row specifies the batch, the second specifies the sample type, where specifically you will need the qc columns for the model fitting, and the third specifies the time or injection order. Then the fourth gives ‘label’ which is the sample naming, and then all subsequent rows give the actual measured values for analytes. The analyte names are given in the second column and each given a number in the first column. The real data measurements start in the third column and the fifth row. Here is what that looks like for the top left part of the data in the attached picture.” The picture was a screenshot of the top left part of the excel file showing the data structure. The resulting code is on GitHub as “serrf-gpt.ipynb.”
2.6 Gene set enrichment analysis (GSEA)
We evaluated three LLMs, Gemini 2.5 Flash, GPT-4o, and GPT-o4-mini-high, on their ability to reproduce the preranked GSEA workflow described by Fang et al. in the GSEApy package (Fang et al. 2023). We used mass spectrometry data from Wang et al. (2024), which profiled 46 migratory and 43 non-migratory single HeLa cells. Data preprocessing was performed to replicate the original study’s Methods, and differential expression analysis was conducted between the most clearly defined migratory and fixed clusters using a two-sided Wilcoxon rank-sum test with Benjamini-Hochberg correction. Genes with an adjusted p-value < 0.05 were retained (total 1,951 proteins) and ranked by log_2_ fold-change to generate the input file for GSEA.
Each model was prompted with the following instructions, along with the GSEApy paper (Fang et al. 2023) and the ranked gene list: “Forget everything you know about gene set enrichment analysis. Using only the information contained in the attached paper GSEApy: a comprehensive package for performing gene set enrichment analysis in Python by Fang et al., and the attached ranked gene list file that I created, please perform gene set enrichment analysis using the KEGG_2021_Human gene set library from scratch. The ranked gene list file includes gene names and associated log fold change values from a differential expression analysis. Based solely on the methods outlined in the paper, extract the 15 pathways with the smallest FDR q values, and generate a publication-ready bar plot showing the top 15 enriched KEGG pathways sorted by -log10(FDR q value). Please adhere strictly to the approach as described in the paper and avoid using any prior knowledge or libraries beyond what is required per the attached paper.”
Gemini 2.5 Flash declined to perform the analysis, citing the inability to execute the code from scratch. GPT-4o reported that the study was too computationally intensive and recommended running the GSEApy package locally. GPT-4o-mini-high initially returned a script that relied on the GSEApy package; however, when we insisted on a “from-scratch” implementation, it provided a standalone Python script that reproduces the enrichment algorithm (Subramanian et al. 2005) as re-implemented by Fang et al. It stated, “This script only uses pandas, numpy, random, and matplotlib.” We then supplied GPT-o4-mini-high with the ranked gene and KEGG_2021_Human.gmt files. The script produced a bar plot first using a permutation of 20 for speed and noted that higher precision would require an increasing number of permutations to 200 or 1,000. We asked GPT-o4-mini-high to use n_perm = 1,000 and a random seed of 42. Separately, for a matching manual implementation, we used GSEApy’s own gp.prerank() function with the following parameters: permutation_num = 1,000, min_size = 15, max_size = 500, and seed = 42.
2.7 Tool usage
Unless otherwise specified, models were not instructed to search the web for additional information, and they did not report using web searches; however, we cannot be certain that the models did not conduct web searches.
2.8 LLM configuration and uncontrolled parameters
Except for the experiment to recreate SERRF, which was done in the Google Collab interface with Gemini, all models were accessed through their respective chat interfaces rather than programmatic APIs. As a result, several internal inference parameters were not visible to the user and could not be modified, including temperature, top-p, sampling seed, and context-window allocation. The parameters we were able to control and therefore report are included in the methods narrative above. Whenever possible, we report:
the specific model’s name selected for each task,the exact prompt text (included in each section above),models were allowed to iteratively attempt self-debugging (for the models that were capable of doing so)the number of overall attempts per task was not documented explicitly, but generally, less than ten attempts were made for any task.
All experiments used the default stochastic settings of the chat interfaces. Because those settings are not exposed to the user, we cannot report explicit numeric values for temperature, seeds, or context limits, although we note that all tested models have context limits that likely far exceed the prompts provided. We note this limitation for transparency, but it does not alter the empirical results.
2.9 LLM writing assistance
Various LLMs were used to help with writing this manuscript. The authors read and edited the generated text, and they are responsible for the contents.
3 Results
3.1 Random Forest
We started with a relatively simple algorithm, the random forest, initially described by Leo Breiman in 2001 (Breiman 2001). We tested Claude 4 Sonnet, which was unable to accept context as long as the PDF and the prompt. We also tested Gemini Pro, which produced an error related to an edge case where there was no information gain after splitting. After pasting the mistake into the chat session, Gemini Pro 2.5 was able to fix the code and produce a working random forest from scratch. Thus, Gemini Pro was considered successful at this task, given that a slightly more complex agent system would likely be able to fix it on its first try. Alternatively, if we had asked it to run the code itself and fix errors in the original prompt, it likely would have been successful. Third, we tested GPT-o4-mini-high with the same prompt. The performance of the code derived from either LLM was effectively equivalent to the scikit-learn (Pedregosa et al. 2011) version (Fig. 2A).
Random forest results. (A) Random forests implemented by Gemini Pro 2.5, GPT-o4-mini-high, or the scikit-learn standard implementation were compared over 100 random splits of the iris data (from scikit-learn) to compute distributions of model performance of classification accuracy. (B) GPT-o4-mini-high was asked to create a regression version for the scikit-learn diabetes dataset, and 100 random splits were repeated to generate distributions of mean squared error (MSE) in predictions.
As a follow-up, GPT-o4-mini-high was asked to create a regressor version from scratch to analyze the scikit-learn diabetes dataset (Fig. 2B). The performance of the regressor was indistinguishable from that of scikit-learn in terms of mean squared error. Together, these results provide support for the notion that LLMs can generate code typically stored in packages on demand.
3.2 Combat
Motivated by its widespread use and more complex mathematical requirements, we set out to produce a custom version of Combat (Johnson et al. 2007), an empirical Bayes method for correcting batch effects in omics data. This concept remains widely used to this day. Combat was initially implemented in R, but more recently, a Python version was published (Behdenna et al. 2023). We presented multiple models a PDF copy of the original paper and a prompt asking to create the method (see Methods). We also had GPT-o4-mini-high write code to generate synthetic data with two batches and two classes, as well as code to perform various qualitative and quantitative assessments of the batch integration. We first confirmed that the data generator was effective by checking whether PyCombat could remove the batch effects and unify the classes. Only GPT-4o and Gemini 2.5 Flash failed to produce code that ran effectively on the first try. GPT-o4-mini-high, Gemini Pro 2.5, and Claude Sonnet 4 all produced code that worked on the first try. We qualitatively compared the performance of each model’s integration of batches and unification of classes, using a single random seed for synthetic data generation (Fig. 3). This indicates that, despite some minor variations in the results, all versions produced qualitatively similar data integration outcomes.
Qualitative comparison of combat implementations between LLMs. The top row shows the data in PCA space colored by batch across the five methods, and the bottom row shows the same data layout colored by class.
For a more rigorous comparison, we computed three metrics: (i) the average silhouette width (ASW), which measures the distances between clusters relative to the distances between samples in a cluster (values closer to one represent tight intra-group samples with distinct inter-group clusters), (ii) Local Inverse Simpson’s Index (LISI), which measures how diverse each sample’s local neighborhood is—higher values mean better mixing of different groups (e.g. batches or cell types); and (iii) classification accuracy based on 5-fold cross-validation using a logistic regression model, which tells us about whether simple models can learn generalizable differentiations of our data based on class or batch label. This analysis revealed that, in general, all four implementations of Combat were equivalent in terms of all three metrics and that any implementation was effective at batch correction (Fig. 4). Using the batch grouping, ASW was high only when the data was uncorrected (see top left of Fig. 3), but ASW increased from the class perspective only after batch correction with any implementation. LISI, from the perspective of batch labels, was only high after batch correction, and from the class perspective, it decreased slightly using any version of Combat. All the batch correction methods were sufficiently suitable to confuse the logistic regression classifier in distinguishing between batches, while the uncorrected data was perfectly classified (top right, Fig. 4). Importantly, although all three implementations imported other packages, such as Pandas and NumPy, when we asked GPT-o4-mini-high whether it could produce a version that relied only on the base Python, it was successful in doing so, using native Python data structures (“pure-python” in Fig. 4).
Quantification of combat equivalence between methods. Average silhouette width (ASW), Local Inverse Simpson’s Index (LISI), and the average accuracy from logistic regression (CVacc) were plotted for batch labels (top) and class labels (bottom). Random data was generated 100 times to compute a distribution of 100 metrics.
3.3 Augusta
Next, we tried to produce code from a package published in a 2024 paper. The algorithm, called Augusta (Musilova et al. 2024), uses mutual information and transcription factor binding motifs to infer gene regulatory networks from temporal gene expression studies. When given the data table from the Dream challenge #4 and the PDF of the paper and asked to troubleshoot any errors it encountered along the way (see prompt in Methods), GPT-o4-mini-high was able to produce an implementation that worked the first time. The inferred network produced connections that perfectly matched those in the version from the actual Python package; however, the signs of those connections differed substantially (Table 1). Repeated attempts produced essentially the same result, suggesting a reproducible but systematically different implementation. We then used https://gitingest.com/ to summarize the code in the Python package as a text file (excluding the data directory). We provided that GitHub repository summary to the model, asking what differences could account for the output differences. As expected, the model was able to identify the discrepancies and produce an implementation that matched the behavior of the released package with perfect metrics. The main differences are summarized by the model in the following three paragraphs. We emphasize that these differences do not reflect errors or deficiencies in the Augusta method or its authorship, but rather illustrate how reasonable implementation choices arise when translating concise manuscript descriptions into executable code, a situation that is common across computational biology methods.
While the core Augusta algorithm defines the number of expression bins, the manuscript text remains agnostic as to whether those bins are equal‐width in expression space or equal‐frequency across time points, a level of implementation detail that is commonly omitted from methods sections. In our reimplementation, we adopted equal-frequency discretization (each bin contains roughly the same number of samples). In contrast, the official code uses NumPy’s histogram2d to carve out fixed-width bins in value space for joint distributions. This subtle choice can materially affect the joint probability estimates—especially for genes whose expression ranges span vastly different scales—and it only became clear upon inspecting their GitHub source rather than from the PDF description alone.
For estimating mutual information (MI), the paper presents the Shannon information formula but does not prescribe a concrete estimator or smoothing strategy. In the absence of a prescribed estimator, we implemented a naïve frequency‐count MI with add-one smoothing to avoid zero‐counts, whereas the released Augusta code leverages sklearn.metrics.mutual_info_score. That library routine carries its internal conventions (no explicit smoothing by default, possible bias corrections for small samples). It thus can produce materially different MI values than a custom-written calculator. This dependency is not specified in the PDF and only surfaced upon examining the released implementation.
Finally, although the manuscript explains that edge direction and sign derive from comparing each gene’s most significant difference (MSD) time-point, it never states that gene-pairs whose MSDs coincide should be excluded outright. Our pipeline first computed MI for every pair, then used MSD ordering to assign directionality even when both genes peaked simultaneously. By contrast, the official Augusta code prefilters away any pair whose MSDs tie—thereby avoiding ambiguous or bidirectional assignments—and only evaluates MI on pairs with strictly distinct MSD positions. This optimization, which both speeds up computation and reduces spurious edges, is implemented in the released source code but is not described in the manuscript, consistent with common practice where performance-oriented heuristics are deferred to software implementations.
3.4 SERRF
A step beyond broadly applicable models, such as random forest, is to apply them in niche domains with highly specialized purposes. Usually, such approaches involve collaboration between a knowledgeable programmer and a domain expert. This includes methods such as systematic error removal using random forest (SERRF) (Fan et al. 2019), which utilizes random forests to remove unwanted variation from large metabolomics datasets. The challenge in coding such applications lies in efficiently understanding the experimental context and handling the metadata.
We first prompted Gemini (within the provided Colab notebooks) to implement SERRF based solely on an uploaded PDF file of the seminal paper and an Excel file of example experimental data, also included in the published version of the software (see Methods). The first prompt failed to produce working code due to various indexing and formatting errors, despite numerous debugging attempts made in an automated loop within the Gemini-produced notebook. This failure may in part be due to the complexity of the example data Excel sheet; when converted into Pandas, it has multiple column and row indices, including duplicate names, which quickly becomes complex to manage with insufficient experimental context. Had we provided an example datasheet more amenable to metadata organization packages, such as explicitly requesting the use of AnnData, along with more guided context regarding the experimental data, this may have increased the likelihood of success. However, the goal was to produce a zero-shot LLM-coded implementation on the fly, not an involved, expert-guided reimplementation. Instead, we attempted simple modifications to the original prompt.
In a new Collab notebook and Gemini session, we next tried to utilize a link to the GitHub repository for SERRF (written in R) to provide additional information for the LLM to retrieve and hopefully produce usable code in Python. The same prompt above was used, except a link to the repository was provided. Here, Gemini produced more structured code and efficient debugging to handle metadata and indexing. A SERRF function was created and successfully implemented. Plots of the batch-to-batch variation in PCA space were made in a similar style to the published software (Fig. 5). However, the PCA score plots of the example dataset before (Fig. 5A and C) and after (Fig. 5B and D) correction suggested that the Gemini implementation failed to remove any substantial variation after correction. Still, the pre-correction PCA plots (although rotated) appear similar to the published results. They are also annotated correctly, suggesting that Gemini succeeded at the common task of annotating and running PCA, but failed at the niche, domain-specific task of removing systematic errors from metabolomics data.
SERRF results. Plots of samples in PCA space using the SERRF example dataset before and after error correction, respectively, using the published SERFF R-Shiny App (A, B), the version from Gemini within Google Collab (C, D), and the version from GPT-o4-mini-high (E, F).
Lastly, we used gitingest.com to produce a text file of the SERRF GitHub repository and provided it in the same prompt used before. This prompt failed to deliver working code, despite providing more information in the form of a full-text file of the GitHub repository contents, which may suggest context rot (Context Rot: How Increasing Input Tokens Impacts LLM Performance 2025).
Interestingly, in all cases upon manual inspection, the logic of the Gemini-produced random forest model code (despite whether it produced an error), did follow the logic described in the SERRF paper, wherein a random forest regressor is trained on quality control samples across batches to predict systematic errors for each analyte in real samples. However, in the first case, where Gemini was provided only with a manuscript PDF and an example dataset file, the SERRF model corrected the samples by subtracting the random-forest-predicted systematic error from each sample value. In the second and third cases, where the repository link or contents were provided, the SERRF implementations utilized a correction factor multiplied by the sample values, which is more in line with the actual implementation described in the SERRF paper. This is perhaps due to the manuscript containing only a high-level description of the SERRF implementation, as opposed to the explicit definitions within the R code of the repository.
We next attempted nearly the same prompt with GPT-o4-mini-high. Upon examining the reasoning chain and errors, it became clear that the input data structure with a multi-index was the source of the problem. We then designed a second prompt to provide a more precise explanation of the input data structure (see Methods). Given this, the model was able to produce a working version of SERRF (Fig. 5E and F).
In summary, Gemini failed to produce a zero-shot implementation of SERRF, although the implemented code appeared to follow the general logic and sometimes ran successfully. It is possible that with more detailed prompting or example datasets, methods, or alternative agents, this approach could have succeeded. However, GPT-o4-mini-high was able to generate the code correctly on the second attempt. This required an understanding of the input data structure. This highlights that for this concept to work in the future, it generally requires greater detail regarding the expected data structures, particularly for situations where multi-index or complex data structures are required.
3.5 GSEA
The GSEApy paper (Fang et al. 2023) was provided to various models with a prompt requesting the method from scratch (see Methods). Notably, as an application note, this paper contains almost no detail about the GSEA method. Instead, it references the original paper, focusing on a high-level description of how they implemented the technique in Rust to gain speed. Of the tested methods, only GPT-o4-mini-high was able to generate a fully functional, from-scratch implementation of the preranked GSEA algorithm, returning enrichment scores comparable to those produced by GSEApy. Although we provided the GSEApy paper and not the original Subramanian et al. paper that contains the actual method details, GPT-o4-mini-high identified the original 2005 publication (Subramanian et al. 2005), recognized its relevance, and followed its methods in the implementation. Although the model stated that it relied only on standard libraries such as Pandas and NumPy, we cannot be certain whether it also drew from the GSEApy repository during algorithm reconstruction, given that it searched the web to find the original paper.
Both the GPT-based method and the GSEApy method were applied to the same ranked gene list derived from differential expression analysis of migratory versus non-migratory HeLa cells, using the KEGG_2021_Human library. Identical parameters were used: permutation_num = 1000, min_size = 15, max_size = 500, and seed = 42. Figure 6A shows the bar plot of the top 15 enriched KEGG pathways generated by the GPT-o4-mini-high, and Fig. 6B displays the same analysis performed using GSEApy, with pathways sorted by -log_10_(FDR q-value) in both plots. The set of top 15 enriched pathways was identical between the two methods, while the q-values differed slightly, resulting in slight differences in the ordering of pathways. Figure 6C presents a scatterplot comparing q-values for overlapping pathways between the GPT-o4-mini-high and GSEApy results. These findings suggest that LLMs can replicate the logic of sophisticated scientific tools, such as GSEA, and can identify relevant literature even without explicit prompting.
Comparison of GSEA results from GPT-o4-mini-high and GSEApy implementations. (A) Bar plot of the top 15 enriched KEGG pathways generated using the GPT-o4-mini-high from-scratch implementation. Pathways are sorted by -log10(FDR q-value), and the x-axis displays normalized enrichment scores (NES). (B) Bar plot showing top 15 pathways produced by GSEApy’s gp.prerank() function. (C) Scatterplot comparing FDR q-values for overlapping pathways between the GPT-o4-mini-high and GSEApy.
4 Discussion
A summary of the model successes across tasks is shown in Table 2. Our systematic evaluation demonstrates that modern LLMs, when guided exclusively by method descriptions in original publications, can effectively reimplement a variety of core computational algorithms with fidelity comparable to well-established software libraries. In all tasks, GPT-o4-mini-high was successful, making it the clear leader in these tasks. Overall, these results suggest that, for well-specified, mathematically grounded methods, LLMs are capable of “zero-shot” synthesis of robust implementations without human-written scaffolding. We note that our benchmark is intentionally limited in scope and is not intended as a comprehensive evaluation of all scientific algorithms, but rather as an initial demonstration of what is currently achievable.
However, our case studies highlight essential caveats. In the Augusta task, despite successfully inferring network topology on the first attempt, the initial LLM-generated code differed in discretization strategy, MI estimation, and edge-direction heuristics. These differences reflect reasonable implementation choices that are typically resolved at the software level rather than in manuscript text (Table 1). Only after retrieving the authors’ GitHub source could the model resolve these implementation-level subtleties and align its outputs with the behavior of the released package. Our case study on SERRF also revealed that details of the data structure are essential. This underscores key limitations of this concept: (i) narrative ambiguity in publications can lead to reproducible but unintended variations in implementation; (ii) integration of external code insights remains essential for resolving under-specified methodological details; and (iii) complex data structures must be specified.
Taken together, our findings suggest that literature-driven code generation may evolve into a useful complement to traditional human-maintained libraries. By treating articles as executable specifications, research teams could reduce the overhead of dependency management, bug triaging, and version conflicts, instead leveraging LLMs to generate bespoke implementations on demand. This paradigm aligns with broader trends in reproducible research, where code provenance and transparency are of paramount importance. It holds promise for democratizing access to cutting-edge methods without requiring extensive software engineering expertise. For example, newcomers to bioinformatics commonly choose between learning python or R based on the availability of open-source packages used in their subfields. As shown by the on-demand conversion of packages such as SERRF (only published in R) to workable Python code in minutes, researchers may simply generate implementations in their language of choice. We do not measure long-term operational maintenance costs; our claims are limited to short-term fidelity and feasibility rather than lifecycle maintenance.
On-demand literature-to-code generation also lessens the burden on developers and expert contributors to provide and maintain manual implementations of packages in R and Python. A representative example is a feature request on a niche bioinformatics algorithm package hosted on a GitHub repository; the package was written to produce static plots automatically, but perhaps the user wishes to add custom labels, change colors, and increase the size of tick labels. Rather than submitting an issue or feature request and hoping the package maintainer responds and dedicates time to implement a fix, what if the repository included a tried-and-tested LLM prompt that was known to reproduce the package? The latter would rely on sufficient evaluation metrics; see below. The user augments said optimized prompt in plain text with “…and be sure to allow the user to dictate how to label and color the plots,” or similar, when generating the on-demand code.
This idea can dovetail with current practices. We urge interested practitioners to include a “Method Specification Prompt” or similar section in their manuscripts, detailing the explicit prompt and LLM that the authors have successfully used to reproduce their method. Even better, the prompts would be included with the packages in code repositories as simple markdown files that can be updated and maintained alongside (and eventually, in place of) the raw code (The New Code–Sean Grove, OpenAI 2025). This practice will not only provide benefit for the informal exploratory users, but the authors themselves; the process of developing a comprehensive prompt is likely to uncover potential bugs, edge cases, and perhaps new ideas regarding the algorithmic implementations, as we found in our results above.
We view the Method Specification Prompt (MSP) as a provisional mechanism rather than a substitute for conventional code release. Its primary value lies in revealing underspecified portions of a methods section and providing a reproducible recipe for LLM-based reimplementation, but it does not eliminate maintenance. In practice, MSP upkeep may shift maintenance efforts from code to prompt structure, schema details, and model compatibility. To make this concrete, an MSP can be distilled into three elements: (i) required inputs and data structures, (ii) the minimal algorithmic steps as expressed in the manuscript, and (iii) evaluation criteria that confirm correct behavior. We include it as an exploratory concept, not a standardized framework.
A potential argument against this practice is reproducibility. Subverting packages would ostensibly only worsen the reproducibility crisis. However, we ascertain that the opposite may be the case, and this practice may enhance reproducibility. If an LLM cannot re-implement a workable version of an algorithm described in a manuscript, how likely is a human reader to be able to do so? Thus, this acts as an immediate litmus test on whether the Methods section of a manuscript is underspecified. The alternative is to continue with the status quo, which often focuses on code repeatability (ie, can a user produce the same result as the authors using the same provided code, package versions, environments, operating system details, and fixed random seeds), as opposed to code reproducibility (producing a working implementation based on a published description)1. As the barrier to delivering working code dissipates, so should the barrier to producing working re-implementations. If developers truly want their code (or underlying idea) to be used correctly and in perpetuity, while simultaneously lowering the burden of maintenance on themselves, we believe that providing a Method Specification Prompt is a potentially beneficial and low-cost endeavor. Perhaps eventually GitHub or similar repositories will shift towards housing specification markdown files, rather than raw code.
Nevertheless, we emphasize that LLM-driven generation is not yet a substitute for rigorous software validation practices. Hallucinations or mistakes still happen. Automated implementations must be subjected to comprehensive testing, code review, and continuous integration workflows to guard against subtle errors, especially in high-stakes domains such as clinical decision support or large-scale omics analyses. Furthermore, prompt engineering and RAG system design will critically influence outcomes; best practices for encoding methodological context, handling edge-case failures, and iterating on generated code require further study. Many of these use cases would need to rely upon evaluation metrics and benchmarks that can robustly ensure the implementation is correctly performing. Such information would be critical to include in the Method Specific Prompt. As this practice evolves, so will the practice of what (and what not) to specify in the prompts (Yang et al. 2025). At present, utilizing this practice for low-stakes, exploratory data analyses can still speed the pace of research while the community experiments with how to best adapt to the era of LLM code generation.
Our benchmark, although diverse in algorithmic complexity, ranging from tree-based learners to term enrichment analysis and network construction based on mutual information, remains limited in scope. Future work should expand evaluations to include stochastic optimization routines, deep-learning architectures, and multi-stage computational pipelines. Additionally, exploring multilingual publications, cross-language code synthesis, and embedding LLM outputs within automated deployment frameworks (e.g. Docker, CI/CD) will be essential for assessing real-world applicability. Finally, although prompts instructed models to ‘forget’ prior knowledge, we cannot verify that they did not refer to the training data or external tools. This is an inherent limitation when using proprietary LLMs and affects both the paper-as-specification idea and the MSP framing. Our results should be interpreted with this residual uncertainty in mind.
In summary, when authors meticulously document every algorithmic detail in their manuscripts, today’s LLMs are already capable of translating those descriptions into working code on demand—providing a potential alternative pathway alongside conventional software libraries and possibly reducing maintenance overhead in some use cases. Further work should investigate whether this approach can be generalized to other programming languages, which would enable it to break down barriers to language-specific implementations and empower researchers to integrate algorithms seamlessly into diverse, complex computational workflows. As publication standards shift toward more structured algorithmic reporting, the divide between method description and on-demand LLM implementation will all but disappear, potentially contributing to more reproducible and broadly accessible computational science.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Behdenna A , Colange M, Haziza J et al py Com Bat, a Python tool for batch effects correction in high-throughput molecular data using empirical Bayes methods. BMC Bioinformatics 2023;24:459. 10.1186/s 12859-023-05578-538057718 PMC 10701943 · doi ↗ · pubmed ↗
- 2Breiman L. Random forests. Mach Learn 2001;45:5–32.
- 3Chen M , Tworek J, Jun H et al Evaluating large language models trained on code. 10.48550/ar Xiv.2107.03374, 2021, preprint: not peer reviewed. · doi ↗
- 4Context Rot: How Increasing Input Tokens Impacts LLM Performance. https://research.trychroma.com/context-rot (24 July 2025, date last accessed).
- 5Fan S , Kind T, Cajka T et al Systematic error removal using random Forest for normalizing Large-Scale untargeted lipidomics data. Anal Chem 2019;91:3590–6.30758187 10.1021/acs.analchem.8b 05592 PMC 9652764 · doi ↗ · pubmed ↗
- 6Fang Z , Liu X, Peltz G. GSE Apy: a comprehensive package for performing gene set enrichment analysis in Python. Bioinformatics 2023;39:btac 757.36426870 10.1093/bioinformatics/btac 757PMC 9805564 · doi ↗ · pubmed ↗
- 7Johnson WE , Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 2007;8:118–27.16632515 10.1093/biostatistics/kxj 037 · doi ↗ · pubmed ↗
- 8Lewis P , Perez E, Piktus A et al Retrieval-augmented generation for knowledge-intensive NLP tasks. 10.48550/ar Xiv.2005.11401, 2021, preprint: not peer reviewed. · doi ↗
