Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

Josef Hardi; Martin J. O'Connor; Marcos Martinez-Romero; Jean G. Rosario; Stephen A. Fisher; Mark A. Musen

arXiv:2604.08552·cs.DB·April 13, 2026

Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

Josef Hardi, Martin J. O'Connor, Marcos Martinez-Romero, Jean G. Rosario, Stephen A. Fisher, Mark A. Musen

PDF

TL;DR

This paper introduces an LLM-based system that enhances biomedical metadata standardization by integrating real-time ontology queries, significantly improving accuracy over models relying solely on training data.

Contribution

It presents a novel approach combining LLMs with real-time authoritative terminology services for scalable biomedical metadata standardization.

Findings

01

Augmenting LLMs with real-time ontology queries improves accuracy.

02

The system outperforms LLM-only approaches on legacy biomedical metadata.

03

Evaluation on 839 records shows practical scalability and effectiveness.

Abstract

Scientific metadata are often incomplete and noncompliant with community standards, limiting dataset findability, interoperability, and reuse. When reporting guidelines exist, they typically lack machine-actionable representations. Producing FAIR datasets requires encoding metadata standards as machine-actionable templates with rich field specifications and precise value constraints. Recent work has shown that LLMs guided by field names and ontology constraints can improve metadata standardization, but these approaches treat constraints as static text prompts, relying on the model's training knowledge alone. We present an LLM-based metadata standardization system that queries authoritative biomedical terminology services in real time to retrieve canonically correct vocabulary terms on demand. We evaluate this approach on 839 legacy metadata records from the Human BioMolecular Atlas…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.