Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs
Gilberto Sussumu Hida, Danilo Monteiro Ribeiro, Erika Yahata

TL;DR
This study evaluates the variability and effectiveness of Large Language Models in screening studies for systematic literature reviews, comparing them with classical classifiers and analyzing input feature impacts.
Contribution
It provides a comprehensive assessment of LLM performance variability, the influence of input metadata, and compares LLMs with traditional models in the context of evidence screening.
Findings
LLMs show high heterogeneity and residual non-determinism.
Removing abstracts degrades LLM performance significantly.
Classical models' performance is often comparable to LLMs, with no consistent superiority.
Abstract
Context: Study screening in systematic literature reviews is costly, inconsistency-prone, and risk-asymmetric, since false negatives can compromise validity. Despite rapid uptake of Large Language Models (LLMs), there is limited evidence on how such models behave during the study screening phase, particularly regarding the choice of specific LLMs and their comparison with classical models. Objective: To assess LLM performance and variability in screening, quantify the impact of input metadata (abstract, title, keywords), and compare LLMs with classical classifiers under a shared protocol. Methods: We analyzed 12 LLMs from 4 providers (OpenAI, Google Gemini, Anthropic, Llama) and 4 classical models (Logistic Regression, Support Vector Classification, Random Forest, and Naive Bayes) on 2 real Systematic Literature Reviews (SLRs), totaling 518 papers. The experimental design investigated 3…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
