Predicting Microbial Ontology and Pathogen Risk from Environmental Metadata with Large Language Models
Hyunwoo Yoo, Gail L. Rosen

TL;DR
This paper demonstrates that large language models can effectively classify microbial ontology categories and predict pathogen risk from environmental metadata alone, outperforming traditional models especially in small-sample and heterogeneous settings.
Contribution
It introduces the use of large language models for microbiome classification and pathogen risk prediction using only metadata, showing superior performance over traditional methods.
Findings
LLMs outperform traditional models in ontology classification.
LLMs demonstrate strong predictive ability for pathogen contamination.
Models generalize across different sites and metadata distributions.
Abstract
Traditional machine learning models struggle to generalize in microbiome studies where only metadata is available, especially in small-sample settings or across studies with heterogeneous label formats. In this work, we explore the use of large language models (LLMs) to classify microbial samples into ontology categories such as EMPO 3 and related biological labels, as well as to predict pathogen contamination risk, specifically the presence of E. Coli, using environmental metadata alone. We evaluate LLMs such as ChatGPT-4o, Claude 3.7 Sonnet, Grok-3, and LLaMA 4 in zero-shot and few-shot settings, comparing their performance against traditional models like Random Forests across multiple real-world datasets. Our results show that LLMs not only outperform baselines in ontology classification, but also demonstrate strong predictive ability for contamination risk, generalizing across sites…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
