# Compact large language models for title and abstract screening in systematic reviews: An assessment of feasibility, accuracy, and workload reduction

**Authors:** Antonio Sciurti, Giuseppe Migliara, Leonardo Maria Siena, Claudia Isonne, Maria Roberta De Blasiis, Alessandra Sinopoli, Jessica Iera, Carolina Marzuillo, Corrado De Vito, Paolo Villari, Valentina Baccolini

PMC · DOI: 10.1017/rsm.2025.10044 · 2025-11-13

## TL;DR

This study evaluates compact large language models for automating title and abstract screening in systematic reviews, finding they can reduce workload while maintaining high accuracy.

## Contribution

The study introduces and compares compact LLMs for systematic review screening, highlighting their feasibility and workload-saving potential.

## Key findings

- LLMs achieved high sensitivity (up to 100%) but low precision (below 10%) in screening records.
- Specificity and workload savings improved at higher rating thresholds.
- GPT-4o-mini was the fastest and most cost-effective model for screening.

## Abstract

Systematic reviews play a critical role in evidence-based research but are labor-intensive, especially during title and abstract screening. Compact large language models (LLMs) offer potential to automate this process, balancing time/cost requirements and accuracy. The aim of this study is to assess the feasibility, accuracy, and workload reduction by three compact LLMs (GPT-4o mini, Llama 3.1 8B, and Gemma 2 9B) in screening titles and abstracts. Records were sourced from three previously published systematic reviews and LLMs were requested to rate each record from 0 to 100 for inclusion, using a structured prompt. Predefined 25-, 50-, 75-rating thresholds were used to compute performance metrics (balanced accuracy, sensitivity, specificity, positive and negative predictive value, and workload-saving). Processing time and costs were registered. Across the systematic reviews, LLMs achieved high sensitivity (up to 100%) and low precision (below 10%) for records included by full text. Specificity and workload savings improved at higher thresholds, with the 50- and 75-rating thresholds offering optimal trade-offs. GPT-4o-mini, accessed via application programming interface, was the fastest model (~40 minutes max.) and had usage costs ($0.14–$1.93 per review). Llama 3.1-8B and Gemma 2-9B were run locally in longer times (~4 hours max.) and were free to use. LLMs were highly sensitive tools for the title/abstract screening process. High specificity values were reached, allowing for significant workload savings, at reasonable costs and processing time. Conversely, we found them to be imprecise. However, high sensitivity and workload reduction are key factors for their usage in the title/abstract screening phase of systematic reviews.

## Full-text entities

- **Diseases:** LLMs (MESH:D007806), VL (MESH:C536141), COVID (MESH:D000086382)
- **Chemicals:** GPT-4o (-)
- **Species:** Acinetobacter baumannii (species) [taxon 470], Homo sapiens (human, species) [taxon 9606], Lama glama (llama, species) [taxon 9844]

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12873614/full.md

---
Source: https://tomesphere.com/paper/PMC12873614