Large Language Models with Human-In-The-Loop Validation for Systematic Review Data Extraction
Noah L. Schroeder, Chris Davis Jaldi, Shan Zhang

TL;DR
This study evaluates the use of large language models for automating data extraction in systematic reviews, demonstrating promising accuracy but emphasizing the importance of human oversight, and introduces an open-source tool for this purpose.
Contribution
It introduces a human-in-the-loop framework and an open-source tool (AIDE) to improve LLM-based data extraction for systematic reviews.
Findings
LLMs achieved over 62% consistency with human coding
Performance varied across different LLMs tested
Human-in-the-loop process is essential for reliable data extraction
Abstract
Systematic reviews are time-consuming endeavors. Historically speaking, knowledgeable humans have had to screen and extract data from studies before it can be analyzed. However, large language models (LLMs) hold promise to greatly accelerate this process. After a pilot study which showed great promise, we investigated the use of freely available LLMs for extracting data for systematic reviews. Using three different LLMs, we extracted 24 types of data, 9 explicitly stated variables and 15 derived categorical variables, from 112 studies that were included in a published scoping review. Overall we found that Gemini 1.5 Flash, Gemini 1.5 Pro, and Mistral Large 2 performed reasonably well, with 71.17%, 72.14%, and 62.43% of data extracted being consistent with human coding, respectively. While promising, these results highlight the dire need for a human-in-the-loop (HIL) process for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
