# Scaling sensor metadata extraction for exposure health using LLMs

**Authors:** Fatemeh Shah-Mohammadi, Sunho Im, Julio C Facelli, Mollie R Cummins, Ramkiran Gouripeddi

PMC · DOI: 10.1093/exposome/osag008 · Exposome · 2026-03-13

## TL;DR

This paper shows how large language models can automate the extraction of sensor metadata from health research, making the process faster and more consistent.

## Contribution

The novel contribution is an LLM-based pipeline for scalable and accurate sensor metadata extraction from unstructured health literature.

## Key findings

- The LLM pipeline achieved 88.0% accuracy and 90.0% F1-score in metadata extraction.
- The automated method was significantly faster than manual review.
- The approach enhances metadata completeness and consistency in exposure health research.

## Abstract

The rapid evolution and diversity of sensor technologies, coupled with inconsistencies in how sensor metadata is reported across formats and sources, present significant challenges for generating exposomes and exposure health research.

Despite the development of standardized metadata schemas, the process of extracting sensor metadata from unstructured sources remains largely manual and unscalable. To address this bottleneck, we developed and evaluated a large language model (LLM)-based pipeline for automating sensor metadata extraction and harmonization from publicly available exposure health literature.

Using GPT-4 in a zero-shot setting, we constructed a pipeline that parses full-text PDFs to extract metadata and harmonizes output into structured formats.

Our automated pipeline achieved substantial efficiency gains in completing extractions much faster than manual review and demonstrated strong performance with 88.0% accuracy, 88.0% precision, 93.0% recall, and an F1-score of 90.0%.

This study demonstrates the feasibility and scalability of leveraging LLMs to automate sensor metadata extraction for exposure health, reducing manual burden while enhancing metadata completeness and consistency. Our findings support the integration of LLM-driven pipelines into exposure health informatics platforms.

## Full-text entities

- **Diseases:** hallucinated (MESH:D006212), LLM (MESH:D007806)
- **Chemicals:** CO (MESH:D002248), SO2 (MESH:D013458), NO2 (MESH:D009585), water (MESH:D014867)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13012662/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13012662/full.md

## References

35 references — full list in the complete paper: https://tomesphere.com/paper/PMC13012662/full.md

---
Source: https://tomesphere.com/paper/PMC13012662