Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent
Josef Hardi, Martin J. O'Connor, Marcos Martinez-Romero, Jean G. Rosario, Stephen A. Fisher, Mark A. Musen

TL;DR
This paper introduces an LLM-based system that enhances biomedical metadata standardization by integrating real-time ontology queries, significantly improving accuracy over models relying solely on training data.
Contribution
It presents a novel approach combining LLMs with real-time authoritative terminology services for scalable biomedical metadata standardization.
Findings
Augmenting LLMs with real-time ontology queries improves accuracy.
The system outperforms LLM-only approaches on legacy biomedical metadata.
Evaluation on 839 records shows practical scalability and effectiveness.
Abstract
Scientific metadata are often incomplete and noncompliant with community standards, limiting dataset findability, interoperability, and reuse. When reporting guidelines exist, they typically lack machine-actionable representations. Producing FAIR datasets requires encoding metadata standards as machine-actionable templates with rich field specifications and precise value constraints. Recent work has shown that LLMs guided by field names and ontology constraints can improve metadata standardization, but these approaches treat constraints as static text prompts, relying on the model's training knowledge alone. We present an LLM-based metadata standardization system that queries authoritative biomedical terminology services in real time to retrieve canonically correct vocabulary terms on demand. We evaluate this approach on 839 legacy metadata records from the Human BioMolecular Atlas…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
