Use of a Structured Knowledge Base Enhances Metadata Curation by Large Language Models
Sowmya S. Sundaram, Benjamin Solomon, Avani Khatri, Anisha Laumas,, Purvesh Khatri, Mark A. Musen

TL;DR
This study explores how GPT-4, combined with a structured knowledge base, can improve metadata standard adherence for datasets, showing significant gains when domain information is provided, thus aiding automated metadata curation.
Contribution
The paper demonstrates that integrating GPT-4 with a structured knowledge base significantly enhances metadata standard adherence, especially with domain-specific information, advancing automated curation methods.
Findings
GPT-4 alone marginally improved adherence from 79% to 80%.
Providing domain information increased adherence to 97%.
Structured knowledge bases enhance LLMs' metadata curation capabilities.
Abstract
Metadata play a crucial role in ensuring the findability, accessibility, interoperability, and reusability of datasets. This paper investigates the potential of large language models (LLMs), specifically GPT-4, to improve adherence to metadata standards. We conducted experiments on 200 random data records describing human samples relating to lung cancer from the NCBI BioSample repository, evaluating GPT-4's ability to suggest edits for adherence to metadata standards. We computed the adherence accuracy of field name-field value pairs through a peer review process, and we observed a marginal average improvement in adherence to the standard data dictionary from 79% to 80% (p<0.5). We then prompted GPT-4 with domain information in the form of the textual descriptions of CEDAR templates and recorded a significant improvement to 97% from 79% (p<0.01). These results indicate that, while LLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Dense Connections · Label Smoothing · Residual Connection · Multi-Head Attention · Adam · Dropout · Softmax
