# From articles to code: on-demand generation of core algorithms from scientific publications

**Authors:** Cameron S Movassaghi, Amanda Momenzadeh, Jesse G Meyer

PMC · DOI: 10.1093/bioinformatics/btag015 · Bioinformatics · 2026-01-11

## TL;DR

This paper explores using AI to generate code from scientific papers, showing it can work well when papers are detailed enough.

## Contribution

The study introduces a benchmark for evaluating code generation from scientific publications using large language models.

## Key findings

- LLMs can reproduce package-level functionality from papers with performance matching established libraries.
- Failures occurred when papers lacked sufficient implementation details or data structure specifications.
- The results highlight where publication standards currently hinder reproducibility in computational methods.

## Abstract

Scientific software packages impose persistent maintenance costs due to dependency churn, version incompatibilities, and bug triage, even when the underlying algorithms are stable and well described. At the same time, peer-reviewed publications already function as the canonical record of many computational methods, yet translating narrative method descriptions into usable code remains labor-intensive and error-prone. Recent advances in large language models (LLMs) raise the question of whether published articles alone can serve as sufficient specifications for on-demand code generation, potentially reducing reliance on continuously maintained libraries.

We systematically evaluated state-of-the-art LLMs by tasking them with implementing core algorithms using only the original scientific publications as input. Across a diverse benchmark including random forests, batch correction methods, gene regulatory network inference, and gene set enrichment analysis, we show that modern LLMs can frequently reproduce package-level functionality with performance indistinguishable from established libraries. Failures and discrepancies primarily arose when manuscripts underspecified implementation details or data structures, rather than from limitations in model reasoning. These results demonstrate that literature-driven code generation is already feasible for many well-specified algorithms, while also exposing where current publication standards hinder reproducibility.

All prompts, generated code, evaluation scripts, and benchmark datasets are publicly available at https://github.com/xomicsdatascience/articles-to-code.

## Full-text entities

- **Genes:** GRN (granulin precursor) [NCBI Gene 2896] {aka CLN11, FTD2, GEP, GP88, PCDGF, PEPI}
- **Diseases:** diabetes (MESH:D003920), LLMs (MESH:D007806)
- **Chemicals:** GPT-4o (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** HeLa — Homo sapiens (Human), Human papillomavirus-related endocervical adenocarcinoma, Cancer cell line (CVCL_0030)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12857569/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12857569/full.md

## References

20 references — full list in the complete paper: https://tomesphere.com/paper/PMC12857569/full.md

---
Source: https://tomesphere.com/paper/PMC12857569