Compact large language models for title and abstract screening in systematic reviews: An assessment of feasibility, accuracy, and workload reduction
Antonio Sciurti, Giuseppe Migliara, Leonardo Maria Siena, Claudia Isonne, Maria Roberta De Blasiis, Alessandra Sinopoli, Jessica Iera, Carolina Marzuillo, Corrado De Vito, Paolo Villari, Valentina Baccolini

TL;DR
This study evaluates compact large language models for automating title and abstract screening in systematic reviews, finding they can reduce workload while maintaining high accuracy.
Contribution
The study introduces and compares compact LLMs for systematic review screening, highlighting their feasibility and workload-saving potential.
Findings
LLMs achieved high sensitivity (up to 100%) but low precision (below 10%) in screening records.
Specificity and workload savings improved at higher rating thresholds.
GPT-4o-mini was the fastest and most cost-effective model for screening.
Abstract
Systematic reviews play a critical role in evidence-based research but are labor-intensive, especially during title and abstract screening. Compact large language models (LLMs) offer potential to automate this process, balancing time/cost requirements and accuracy. The aim of this study is to assess the feasibility, accuracy, and workload reduction by three compact LLMs (GPT-4o mini, Llama 3.1 8B, and Gemma 2 9B) in screening titles and abstracts. Records were sourced from three previously published systematic reviews and LLMs were requested to rate each record from 0 to 100 for inclusion, using a structured prompt. Predefined 25-, 50-, 75-rating thresholds were used to compute performance metrics (balanced accuracy, sensitivity, specificity, positive and negative predictive value, and workload-saving). Processing time and costs were registered. Across the systematic reviews, LLMs…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMeta-analysis and systematic reviews · Health Policy Implementation Science · Mental Health via Writing
