# Assessing the Feasibility and Acceptability of a Bespoke Large Language Model Pipeline to Extract Data From Different Study Designs for Public Health Evidence Reviews

**Authors:** Zalaya Simmons, Beti Evans, Tamsyn Harris, Harry Woolnough, Lauren Dunn, Jonathon Fuller, Kerry Cella, Daphne Duval

PMC · DOI: 10.1002/cesm.70061 · Cochrane Evidence Synthesis and Methods · 2025-11-04

## TL;DR

This study explores using a custom AI system to extract data from various public health studies, finding it mostly accurate but needing human oversight.

## Contribution

The study introduces a tailored LLM pipeline for data extraction across diverse study designs in public health evidence reviews.

## Key findings

- 68% of data fields extracted by the LLM were rated as acceptable by human reviewers.
- Acceptability varied by data field, with 'objective' and 'study design' showing 90% or higher acceptability.
- The mean maximum agreement rate for reliability was 0.71, indicating moderate consistency in LLM outputs.

## Abstract

Data extraction is a critical but resource‐intensive step of the evidence review process. Whilst there is evidence that artificial intelligence (AI) and large language models (LLMs) can improve the efficiency of data extraction from randomized controlled trials, their potential for other study designs is unclear. In this context, this study aimed to evaluate the performance of a bespoke LLM model pipeline (Retrieval‐Augmented Generation pipeline utilizing LLaMa 3‐70B) to automate data extraction from a range of study designs by assessing the accuracy and reliability of the extractions measured as error types and acceptability.

Accuracy was assessed by retrospectively comparing the LLM extractions against human extractions from a review previously conducted by the authors. A total of 173 data fields from 24 articles (including experimental, observational, qualitative, and modeling studies) were assessed, of which three were used for prompt engineering. Reliability was assessed by calculating the mean maximum agreement rate (the highest proportion of identical returns from 10 consecutive extractions) for 116 data fields from 16 of the 24 studies. An evaluation framework was developed to assess the accuracy and reliability of LLM outputs measured as error types and acceptability (acceptability was assessed on whether it would be usable in real‐world settings if the model acted as one reviewer and a human as a second reviewer).

Of the 173 data fields evaluated for accuracy, 68% were rated by human reviewers as acceptable (consistent with what is deemed to be acceptable data extraction from a human reviewer). However, acceptability ratings varied depending on the data field extracted (33% to 100%), with at least 90% acceptability for “objective,” “setting,” and “study design,” but 54% or less for data fields such as “outcome” and “time period.” For reliability, the mean maximum agreement rate was 0.71 (SD: 0.28), with variation across different data fields.

This evaluation demonstrates the potential for LLMs, when paired with human quality assurance, to support data extraction in evidence reviews that include a range of study designs. However, further improvements in performance and validation are required before the model can be introduced into review workflows.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12584109/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12584109/full.md

## References

49 references — full list in the complete paper: https://tomesphere.com/paper/PMC12584109/full.md

---
Source: https://tomesphere.com/paper/PMC12584109