Compact large language models for title and abstract screening in systematic reviews: An assessment of feasibility, accuracy, and workload reduction

Antonio Sciurti; Giuseppe Migliara; Leonardo Maria Siena; Claudia Isonne; Maria Roberta De Blasiis; Alessandra Sinopoli; Jessica Iera; Carolina Marzuillo; Corrado De Vito; Paolo Villari; Valentina Baccolini

PMC · DOI:10.1017/rsm.2025.10044·November 13, 2025

Compact large language models for title and abstract screening in systematic reviews: An assessment of feasibility, accuracy, and workload reduction

Antonio Sciurti, Giuseppe Migliara, Leonardo Maria Siena, Claudia Isonne, Maria Roberta De Blasiis, Alessandra Sinopoli, Jessica Iera, Carolina Marzuillo, Corrado De Vito, Paolo Villari, Valentina Baccolini

PDF

Open Access

TL;DR

This study evaluates compact large language models for automating title and abstract screening in systematic reviews, finding they can reduce workload while maintaining high accuracy.

Contribution

The study introduces and compares compact LLMs for systematic review screening, highlighting their feasibility and workload-saving potential.

Findings

01

LLMs achieved high sensitivity (up to 100%) but low precision (below 10%) in screening records.

02

Specificity and workload savings improved at higher rating thresholds.

03

GPT-4o-mini was the fastest and most cost-effective model for screening.

Abstract

Systematic reviews play a critical role in evidence-based research but are labor-intensive, especially during title and abstract screening. Compact large language models (LLMs) offer potential to automate this process, balancing time/cost requirements and accuracy. The aim of this study is to assess the feasibility, accuracy, and workload reduction by three compact LLMs (GPT-4o mini, Llama 3.1 8B, and Gemma 2 9B) in screening titles and abstracts. Records were sourced from three previously published systematic reviews and LLMs were requested to rate each record from 0 to 100 for inclusion, using a structured prompt. Predefined 25-, 50-, 75-rating thresholds were used to compute performance metrics (balanced accuracy, sensitivity, specificity, positive and negative predictive value, and workload-saving). Processing time and costs were registered. Across the systematic reviews, LLMs…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species3

Acinetobacter baumannii(species)Homo sapiens(human · species)Lama glama(llama · species)

Chemicals1

GPT-4o

Diseases3

LLMs VL COVID

Figures6

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMeta-analysis and systematic reviews · Health Policy Implementation Science · Mental Health via Writing