# Evaluation of search-enabled pretrained Large Language Models on retrieval tasks for the PubChem database

**Authors:** Ash Sze, Soha Hassoun

PMC · DOI: 10.1093/bioadv/vbaf064 · Bioinformatics Advances · 2025-03-24

## TL;DR

This paper explores how ChatGPT-4o can be used to retrieve data from the PubChem database, showing that it can effectively generate programmatic access.

## Contribution

The study introduces a methodology for using a search-enabled LLM to access PubChem, demonstrating improved retrieval accuracy.

## Key findings

- Instructing ChatGPT-4o to generate programmatic access yields correct answers more frequently.
- A methodology was developed to adapt existing PubChem protocols into LLM prompts.
- Iterative prompt refinement enhances the LLM's ability to retrieve accurate data.

## Abstract

Databases are indispensable in biological and biomedical research, hosting vast amounts of structured and unstructured data, facilitating the organization, retrieval, and analysis of complex data. Database access, however, remains a manual, tedious, and sometimes overwhelming, task. The availability of Large Language Models (LLMs) has the potential to play a transformative role in accessing databases.

We investigate in this study the current state of using a pretrained, search-enabled LLMs (ChatGPT-4o), for data retrieval from PubChem, a flagship database that plays a critical role in biological and biomedical research. We evaluate eight PubChem access protocols that were previously documented. We develop a methodology for adopting the protocols into an LLM-prompt, where we supplement the prompt with additional context through iterative prompt refinement as needed. To further evaluate the LLM capabilities, we instruct the LLM to perform the retrieval. We quantitatively and qualitatively show that instructing ChatGPT-4o to generate programmatic access is more likely to yield the correct answers. We provide insightful future directions in developing LLMs for database access.

All text used to prompt ChatGPT-4o is provided in the manuscript.

## Full-text entities

- **Genes:** CYP2C8 (cytochrome P450 family 2 subfamily C member 8) [NCBI Gene 1558] {aka CPC8, CYP2C8DM, CYPIIC8, MP-12/MP-20}, B3GAT2 (beta-1,3-glucuronyltransferase 2) [NCBI Gene 135152] {aka GLCATS}, CYP2C19 (cytochrome P450 family 2 subfamily C member 19) [NCBI Gene 1557] {aka CPCJ, CYP2C, CYPIIC17, CYPIIC19, P450C2C, P450IIC19}
- **Diseases:** LLMs (MESH:D007806)
- **Chemicals:** H (MESH:D006859), I (MESH:D007455), F (MESH:D005461), Olmesartan (MESH:C437965), Irbesartan (MESH:D000077405), C (MESH:D002244), octanol (MESH:D000442), Si (MESH:D012825), Valsartan (MESH:D000068756), S (MESH:D013455), Br (MESH:D001966), water (MESH:D014867), losartan (MESH:D019808), P (MESH:D010758), O (MESH:D010100), N (MESH:D009584), CID (-), silver (MESH:D012834), Candesartan (MESH:C081643), Cl (MESH:D002713), gold (MESH:D006046)
- **Species:** Rattus norvegicus (brown rat, species) [taxon 10116], Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12073969/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12073969/full.md

## References

18 references — full list in the complete paper: https://tomesphere.com/paper/PMC12073969/full.md

---
Source: https://tomesphere.com/paper/PMC12073969