# Enhancing Evidence Synthesis Efficiency: Leveraging Large Language Models and Agentic Workflows for Optimized Literature Screening

**Authors:** Bing Hu, Emmalie Tomini, Tricia Corrin, Kusala Pussegoda, Elias Sandner, Andre Henriques, Alice Simniceanu, Luca Fontana, Andreas Wagner, Stephanie Brazeau, Lisa Waddell

PMC · DOI: 10.1002/cesm.70042 · Cochrane Evidence Synthesis and Methods · 2025-10-21

## TL;DR

This paper introduces GREP-Agent, an AI system that improves the efficiency of screening scientific literature for public health evidence using multiple large language models and human feedback.

## Contribution

The novel contribution is the GREP-Agent framework, which combines multiple LLMs and human feedback to optimize literature screening accuracy and workload reduction.

## Key findings

- GREP-Agent increased sensitivity across LLMs to 84.2%–88.9% after fine-tuning.
- Workload reduction strategies improved sensitivity to 86.4%–95.3%.
- Clarity of screening questions and threshold settings significantly impacted performance.

## Abstract

Public health events of international concern highlight the need for up‐to‐date evidence curated using sustainable processes that are accessible. In development of the Global Repository of Epidemiological Parameters (grEPI) we explore the performance of an agentic‐AI assisted pipeline (GREP‐Agent) for screening evidence which capitalizes on recent advancements in large language models (LLMs).

In this study, the performance of the GREP‐Agent was evaluated on a data set of 2000 citations from a systematic review on measles using four LLMs (GPT4o, GPT4o‐mini, Llama3.1, and Phi4). The GREP‐Agent framework integrates multiple LLMs and human feedback to fine‐tune its performance, optimize workload reduction and accuracy in screening research articles. The impact on performance of each part of this Agentic‐AI system is presented and measured by accuracy, precision, recall, and F1‐score metrics.

The results show how each phase of the GREP‐Agent system incrementally improves accuracy regardless of the LLM. We found that GREP‐Agent was able to increase sensitivity across a broad range of open source and proprietary LLMs to 84.2%–88.9% after fine‐tuning and to 86.4%–95.3% by varying workload reduction strategies. Performance was significantly impacted by the clarity of the screening questions and setting thresholds for optimized workload reduction strategies.

The GREP‐Agent shows promise in improving the efficiency and effectiveness of evidence synthesis in dynamic public health contexts. Further development and refinement of adaptable human‐in‐the‐loop AI systems for screening literature are essential to support future public health response activities, while maintaining a human‐centric approach.

## Linked entities

- **Diseases:** measles (MONDO:0004619)

## Full-text entities

- **Diseases:** measles (MESH:D008457)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12538819/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12538819/full.md

## References

26 references — full list in the complete paper: https://tomesphere.com/paper/PMC12538819/full.md

---
Source: https://tomesphere.com/paper/PMC12538819