What Level of Automation is "Good Enough"? A Benchmark of Large Language Models for Meta-Analysis Data Extraction
Lingbo Li, Anuradha Mathrani, Teo Susnjak

TL;DR
This study benchmarks large language models for extracting data from RCTs in meta-analyses, highlighting their high precision but limited recall, and proposes guidelines for effective automation balancing efficiency and oversight.
Contribution
It evaluates multiple LLMs and prompting strategies for meta-analysis data extraction, providing practical guidelines for task-specific automation in medical research.
Findings
Models have high precision but low recall.
Customized prompts improve recall by up to 15%.
Guidelines match data types to automation levels.
Abstract
Automating data extraction from full-text randomised controlled trials (RCTs) for meta-analysis remains a significant challenge. This study evaluates the practical performance of three LLMs (Gemini-2.0-flash, Grok-3, GPT-4o-mini) across tasks involving statistical results, risk-of-bias assessments, and study-level characteristics in three medical domains: hypertension, diabetes, and orthopaedics. We tested four distinct prompting strategies (basic prompting, self-reflective prompting, model ensemble, and customised prompts) to determine how to improve extraction quality. All models demonstrate high precision but consistently suffer from poor recall by omitting key information. We found that customised prompts were the most effective, boosting recall by up to 15\%. Based on this analysis, we propose a three-tiered set of guidelines for using LLMs in data extraction, matching data types…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
