Retrieval-Augmented Language Models Enable Scalable Chemical Source Classification in Metabolomics Workflows
Prajit Rajkumar, Runbang Tang, Harshada Sapre, Jasmine Zemlin, Victoria Deleray, Jeong In Seo, Siddharth Mohan, Shipei Xing, Harsha Gouda, Yasin El Abiead, Shirley M. Tsunoda, Haoqi Nina Zhao, Pieter C. Dorrestein

TL;DR
This paper introduces chemsource, a tool that uses AI to automatically classify chemicals into exposure categories, improving metabolomics workflows.
Contribution
chemsource is a novel framework using LLMs and RAG to automate chemical source classification with customizable prompts.
Findings
chemsource achieved 75% overall agreement with manual labels for 4,953 compounds.
Expert review found discrepancies due to prompt ambiguity and incomplete labels, not model failure.
chemsource revealed exposure patterns across human biospecimens, mouse tissues, and consumer products.
Abstract
There is a growing need for scalable chemical classification to support the interpretation of exposomics and metabolomics data. While structural categorization has been largely automated, functional and exposure-based labeling of chemicals remains a manual and time-consuming process. Here, we present chemsource, a flexible framework that integrates large language models (LLMs) with retrieval-augmented generation (RAG) to automate chemical classification. chemsource retrieves descriptive text from Wikipedia or PubMed abstracts based on chemical names and prompts LLMs to assign user-defined categories based on the retrieved content. We demonstrate classification into five exposure categories: endogenous metabolites, food molecules, drugs, personal care products, industrial chemicals, and combinations of these possibilities. Benchmarking against manually curated labels for 4,953 compounds…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMetabolomics and Mass Spectrometry Studies · Health, Environment, Cognitive Aging · Computational Drug Discovery Methods
