# Retrieval-Augmented Language Models Enable Scalable Chemical Source Classification in Metabolomics Workflows

**Authors:** Prajit Rajkumar, Runbang Tang, Harshada Sapre, Jasmine Zemlin, Victoria Deleray, Jeong In Seo, Siddharth Mohan, Shipei Xing, Harsha Gouda, Yasin El Abiead, Shirley M. Tsunoda, Haoqi Nina Zhao, Pieter C. Dorrestein

PMC · DOI: 10.1021/acs.analchem.5c05301 · 2026-01-29

## TL;DR

This paper introduces chemsource, a tool that uses AI to automatically classify chemicals into exposure categories, improving metabolomics workflows.

## Contribution

chemsource is a novel framework using LLMs and RAG to automate chemical source classification with customizable prompts.

## Key findings

- chemsource achieved 75% overall agreement with manual labels for 4,953 compounds.
- Expert review found discrepancies due to prompt ambiguity and incomplete labels, not model failure.
- chemsource revealed exposure patterns across human biospecimens, mouse tissues, and consumer products.

## Abstract

There is a growing
need for scalable chemical classification to
support the interpretation of exposomics and metabolomics data. While
structural categorization has been largely automated, functional and
exposure-based labeling of chemicals remains a manual and time-consuming
process. Here, we present chemsource, a flexible
framework that integrates large language models (LLMs) with retrieval-augmented
generation (RAG) to automate chemical classification. chemsource retrieves descriptive text from Wikipedia or PubMed abstracts based
on chemical names and prompts LLMs to assign user-defined categories
based on the retrieved content. We demonstrate classification into
five exposure categories: endogenous metabolites, food molecules,
drugs, personal care products, industrial chemicals, and combinations
of these possibilities. Benchmarking against manually curated labels
for 4,953 compounds showed 75% overall agreement, with category-level
recall exceeding 75% across all classes. Expert review indicated that
most discrepancies could be attributed to prompt ambiguity and incomplete
manual labels rather than model failure. To demonstrate the utility
of chemsource in metabolomics workflow, we applied
it to eight public untargeted metabolomics data sets, revealing distinct
exposure patterns across human biospecimens, mouse tissues, environmental
dust, and consumer product extracts. chemsource is
customizable via prompt editing, enabling diverse classification tasks
without requiring coding expertise. The tool is freely available as
a Python package (https://pypi.org/project/chemsource/). Text retrieval is free;
classification requires user-supplied LLM API access.

## Linked entities

- **Species:** Mus musculus (taxon 10090)

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606], Mus musculus (house mouse, species) [taxon 10090]

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12903069/full.md

---
Source: https://tomesphere.com/paper/PMC12903069