# Identification of biomedical entities from multiple repositories using a specialized metadata schema and search-augmented large language models

**Authors:** Klaus Kaier, Felix Engel, Gita Benadi, Claudia Giuliani, Manuel Watter, Aref Kalantari, Karin Schuller, Claus-Werner Franzke, Markus Sperandio, Harald Binder

PMC · DOI: 10.1186/s13104-026-07632-w · BMC Research Notes · 2026-01-12

## TL;DR

This paper introduces a method using metadata schemas and large language models to accurately identify and annotate datasets from biomedical articles across multiple repositories.

## Contribution

A novel three-step, search-augmented prompting strategy for multi-repository dataset annotation using LLMs.

## Key findings

- Gemini 2.5 Pro outperformed GPT-4.1 and Claude Sonnet 4 in dataset annotation precision.
- Repository-grounded extraction achieved higher precision than including article information.
- The method reliably detects datasets using repository landing page information.

## Abstract

Many biomedical articles reference multiple datasets across different public repositories, complicating accurate metadata capture and downstream re-use. Building on our prior grounded large language model (LLM) workflows for biomedical entity annotation, we extend the approach to identify and annotate all datasets referenced by a paper, even when distributed across repositories, by combining a specialized metadata schema with a three-step, search-augmented prompting strategy.

In the Transregional Collaborative Research Center PILOT (TRR 359 “Perinatal Development of Immune Cell Topology”), Gene Expression Omnibus (GEO) releases are common alongside additional repository deposits. The applied approach reliably detected datasets referenced in articles and produced schema-compliant annotations using information available on the repository landing pages. After validation through structured face-to-face interviews with the article’s senior author, Gemini 2.5 Pro achieved higher precision (97.1%) than GPT-4.1 (81.9%, p < 0.001) and Claude Sonnet 4 (88.6%, p < 0.001). Limiting the annotation to the information available in the repositories achieved higher precision than adding information from the article (919% vs. 88.3% across all LLMs, p = 0.004). These results indicate that simple repository-grounded extraction enables high quality, multi-dataset metadata annotation which has the potential to minimize the time and effort required for manual metadata annotation.

The online version contains supplementary material available at 10.1186/s13104-026-07632-w.

## Full-text entities

- **Diseases:** CRC (MESH:D015179), LLM (MESH:D007806), hallucinations (MESH:D006212)
- **Chemicals:** Claude Sonnet 4 (-), Pro (MESH:D011392)
- **Species:** Homo sapiens (human, species) [taxon 9606], Mus musculus (house mouse, species) [taxon 10090]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12837611/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12837611/full.md

## References

1 references — full list in the complete paper: https://tomesphere.com/paper/PMC12837611/full.md

---
Source: https://tomesphere.com/paper/PMC12837611