High-performance automated abstract screening with large language model ensembles
Rohan Sanghera, Arun James Thirunavukarasu, Marc El Khoury, Jessica, O'Logbon, Yuqing Chen, Archie Watt, Mustafa Mahmood, Hamid Butt, George, Nishimura, Andrew Soltan

TL;DR
This study evaluates the use of large language models for automated abstract screening in systematic reviews, demonstrating they can outperform humans in sensitivity and precision, thus potentially reducing labor costs.
Contribution
The paper introduces a comprehensive evaluation of multiple LLMs for zero-shot abstract screening, highlighting their superior performance and consistency across large datasets.
Findings
LLMs achieved perfect sensitivity in many cases.
LLMs showed higher sensitivity than human reviewers.
Ensembles of LLMs maintained high sensitivity with moderate precision.
Abstract
Large language models (LLMs) excel in tasks requiring processing and interpretation of input text. Abstract screening is a labour-intensive component of systematic review involving repetitive application of inclusion and exclusion criteria on a large volume of studies identified by a literature search. Here, LLMs (GPT-3.5 Turbo, GPT-4 Turbo, GPT-4o, Llama 3 70B, Gemini 1.5 Pro, and Claude Sonnet 3.5) were trialled on systematic reviews in a full issue of the Cochrane Library to evaluate their accuracy in zero-shot binary classification for abstract screening. Trials over a subset of 800 records identified optimal prompting strategies and demonstrated superior performance of LLMs to human researchers in terms of sensitivity (LLM-max = 1.000, human-max = 0.775), precision (LLM-max = 0.927, human-max = 0.911), and balanced accuracy (LLM-max = 0.904, human-max = 0.865). The best performing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies
MethodsLinear Layer · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Dropout · Absolute Position Encodings
