Human vs. AI in Conducting Scoping Reviews: Evaluating Large Language Model Accuracy Across Article and Task Type

Chang Yu; Mo Han; Rui Huang; Hanna Grol-Prokopczyk; Gongda Yu

PMC · DOI:10.1093/geroni/igaf122.2181·December 31, 2025

Human vs. AI in Conducting Scoping Reviews: Evaluating Large Language Model Accuracy Across Article and Task Type

Chang Yu, Mo Han, Rui Huang, Hanna Grol-Prokopczyk, Gongda Yu

PDF

Open Access

TL;DR

This paper compares human and AI accuracy in coding data for scoping reviews, finding that AI performs better on simple tasks and certain article types.

Contribution

The study provides empirical evidence on ChatGPT-4o's reliability in content coding for scoping reviews across different article and task types.

Findings

01

Human-AI agreement was higher for single-select questions (71%) than for multiple-select questions (29%).

02

Agreement was highest for meta-analyses (85% for single-select) and lowest for narrative reviews (17% for multiple-select).

03

ChatGPT-4o's reliability declines with complex tasks and diverse article types.

Abstract

Large language models (LLMs), such as ChatGPT’s, are increasingly used to assist with health- and aging-related scoping reviews, which are often very time-consuming when done by humans alone. However, empirical evidence on the reliability of LLMs in extracting data from peer-reviewed literature (content coding) remains limited. This study evaluates the accuracy of ChatGPT-4o’s content coding across different article types (systematic reviews, narrative reviews, and meta-analyses) and task types (e.g., single-select vs. multiple-select questions) by comparing its results to human coding from an existing scoping review. We selected 26 articles from a previously human-coded scoping review of 398 articles on social disparities (including age-related disparities) in pain. We then used ChatGPT-4o’s Application Programming Interface (API) to extract and code eight characteristics of each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Mental Health via Writing · Digital Mental Health Interventions