# Study protocol for evaluating automation of systematic review processes with EPPI-Reviewer and Copilot 365 in updating the cataract evidence gap map

**Authors:** Bhavisha Virendrakumar, Hugh Sharma Waddington, Pauline Scheelbeek, Emma Jolley, Elena Schmidt

PMC · DOI: 10.1186/s13643-026-03101-4 · 2026-02-17

## TL;DR

This study evaluates how AI tools like EPPI-Reviewer and Copilot 365 can automate and improve the efficiency of updating a cataract evidence gap map.

## Contribution

The study introduces a protocol to assess AI tools in systematic review processes, comparing their accuracy and efficiency to human performance.

## Key findings

- AI tools will be tested for accuracy and efficiency in screening, data extraction, and critical appraisal stages.
- Results will guide the use of AI in evidence synthesis and highlight its limitations.
- Copilot 365's performance will be compared to human reviewers using statistical measures like Cohen’s Kappa.

## Abstract

The process of developing and updating an evidence gap map (EGM) is based on the principles of systematic reviews and requires extensive time and financial resources. Artificial intelligence (AI) tools, like prioritisation screening (PS), integrated into programmes such as EPPI-Reviewer (ER) and Copilot 365, can potentially mimic human performance in systematic review processes. ER is a subscription-based web application employed by systematic review groups, while Copilot 365, integrated into Microsoft 365, offers real-time assistance. Although ER shows promise in speeding up screening, the optimal threshold for accuracy remains unclear. Additionally, there is no evidence on the effectiveness of any version of Copilot in systematic review and EGM processes.

Assess the accuracy and efficiency of Copilot 365 and PS integrated into ER at different stages of an EGM update, comparing it to human performance.

We will conduct both manual and automated screening of references, full-text screening, data extraction, and critical appraisal. Two reviewers will independently screen studies for inclusion, extract data, and appraise included studies, resolving conflicts through discussion. We will assess the accuracy and efficiency of Copilot 365 and ER at different EGM update stages, comparing them to human performance. To evaluate the PS accuracy, we will use 20% and 40% manual screening thresholds, calculating the proportion of relevant references prioritised by PS and the total relevant citations missed. We will compare Copilot 365’s full-text screening accuracy to reviewers’ decisions and assess consistency using Cohen’s Kappa. For automated data extraction and appraisal, we will manually inspect 20% of Copilot 365’s outputs, comparing them to reviewers’ results, measuring consistency with Cohen’s Kappa, and evaluating time savings by comparing the time taken for manual extraction versus using Copilot 365.

This study will offer insights into ER’s accuracy in screening small samples of citations and potentially guide future applications in this context. Additionally, by evaluating Copilot 365, which shares similar features with other AI tools, we will gain a broader understanding of its applicability and limitations in evidence synthesis, making the results relevant to other AI applications in this field.

Registered at Open Science Framework: https://doi.org/10.17605/OSF.IO/49BX8.

The online version contains supplementary material available at 10.1186/s13643-026-03101-4.

## Linked entities

- **Diseases:** cataract (MONDO:0005129)

## Full-text entities

- **Chemicals:** Copilot 365 (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13014901/full.md

---
Source: https://tomesphere.com/paper/PMC13014901