High-performance automated abstract screening with large language model   ensembles

Rohan Sanghera; Arun James Thirunavukarasu; Marc El Khoury; Jessica; O'Logbon; Yuqing Chen; Archie Watt; Mustafa Mahmood; Hamid Butt; George; Nishimura; Andrew Soltan

arXiv:2411.02451·cs.CL·November 25, 2024·2 cites

High-performance automated abstract screening with large language model ensembles

Rohan Sanghera, Arun James Thirunavukarasu, Marc El Khoury, Jessica, O'Logbon, Yuqing Chen, Archie Watt, Mustafa Mahmood, Hamid Butt, George, Nishimura, Andrew Soltan

PDF

Open Access

TL;DR

This study evaluates the use of large language models for automated abstract screening in systematic reviews, demonstrating they can outperform humans in sensitivity and precision, thus potentially reducing labor costs.

Contribution

The paper introduces a comprehensive evaluation of multiple LLMs for zero-shot abstract screening, highlighting their superior performance and consistency across large datasets.

Findings

01

LLMs achieved perfect sensitivity in many cases.

02

LLMs showed higher sensitivity than human reviewers.

03

Ensembles of LLMs maintained high sensitivity with moderate precision.

Abstract

Large language models (LLMs) excel in tasks requiring processing and interpretation of input text. Abstract screening is a labour-intensive component of systematic review involving repetitive application of inclusion and exclusion criteria on a large volume of studies identified by a literature search. Here, LLMs (GPT-3.5 Turbo, GPT-4 Turbo, GPT-4o, Llama 3 70B, Gemini 1.5 Pro, and Claude Sonnet 3.5) were trialled on systematic reviews in a full issue of the Cochrane Library to evaluate their accuracy in zero-shot binary classification for abstract screening. Trials over a subset of 800 records identified optimal prompting strategies and demonstrated superior performance of LLMs to human researchers in terms of sensitivity (LLM-max = 1.000, human-max = 0.775), precision (LLM-max = 0.927, human-max = 0.911), and balanced accuracy (LLM-max = 0.904, human-max = 0.865). The best performing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies

MethodsLinear Layer · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Dropout · Absolute Position Encodings