# Accelerating the pace and accuracy of systematic reviews using AI: a validation study

**Authors:** Jiada Zhan, Kara Suvada, Muwu Xu, Wenya Tian, Kelly C. Cara, Taylor C. Wallace, Mohammed K. Ali

PMC · DOI: 10.1186/s13643-025-02997-8 · 2025-12-18

## TL;DR

This study validates the use of AI in speeding up systematic reviews by comparing its accuracy to human decisions in screening research articles.

## Contribution

The study evaluates the performance of Review Copilot (GPT-4) in systematic review screening tasks against human decisions.

## Key findings

- Review Copilot showed high sensitivity (99.2%) but moderate specificity (83.6%) in title/abstract screening.
- Full-text screening by Review Copilot had high sensitivity (97.6%) but lower specificity (47.4%).
- AI screening was completed in one-quarter of the time compared to human screening.

## Abstract

Artificial intelligence (AI) can greatly enhance efficiency in systematic literature reviews and meta-analyses, but its accuracy in screening titles/abstracts and full-text articles is uncertain.

This study evaluated the performance metrics (sensitivity, specificity) of a GPT-4 AI program, Review Copilot, against human decisions (gold standard) in screening titles/abstracts and full-text articles from four published systematic reviews/meta-analyses.

Participant data from four already-published systematic literature reviews were used for this validation study. This was a study comparing Review Copilot to human decision-making (gold standard) in screening titles/abstracts and full-text articles for systematic reviews/meta-analyses. The four studies that were used in this study included observational studies and randomized control trials. Review Copilot operates on the OpenAI, GPT-4 server. We examined the performance metrics of Review Copilot to include and exclude titles/abstracts and full-text articles as compared to human decisions in four systematic reviews/meta-analyses. Sensitivity, specificity, and balanced accuracy of title/abstract and full-text screening were compared between Review Copilot and human decisions.

Review Copilot’s sensitivity and specificity for title/abstract screening were 99.2% and 83.6%, respectively, and 97.6% and 47.4% for full-text screening. The average agreement between two runs was 95.4%, with a kappa statistic of 0.83. Review Copilot screened in one-quarter of the time compared to humans.

AI use in systematic reviews and meta-analyses is inevitable. Health researchers must understand these technologies’ strengths and limitations to ethically leverage them for research efficiency and evidence-based decision-making in health.

The online version contains supplementary material available at 10.1186/s13643-025-02997-8.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12829171/full.md

---
Source: https://tomesphere.com/paper/PMC12829171