LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews

Lech Madeyski; Barbara Kitchenham; Martin Shepperd

arXiv:2511.12635·cs.SE·April 28, 2026

LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews

Lech Madeyski, Barbara Kitchenham, Martin Shepperd

PDF

TL;DR

This paper introduces WMCC, a new metric for evaluating large language models in literature screening, emphasizing cost-sensitive assessment and comprehensive reporting practices.

Contribution

It proposes WMCC, a cost-sensitive evaluation metric, and provides practical guidelines for assessing LLM performance in systematic review screening.

Findings

01

WMCC often disagrees with accuracy and MCC in ranking LLMs.

02

Most studies lack full confusion matrix reporting and cost-sensitive metrics.

03

Using WMCC reduces false negatives in literature screening evaluations.

Abstract

Context: Large language models (LLMs) are increasingly used to screen literature for systematic reviews (SRs), but the standard confusion-matrix metrics used to evaluate them can mislead under the imbalanced, cost-asymmetric conditions of screening. Objective: We develop and justify LLM4SCREENLIT-practical recommendations for researchers conducting LLM-screening evaluations and for editors and reviewers assessing such studies-differentiated by study type (retrospective benchmarking vs deployment for a specific SR). Method: Using Delgado-Chaves et al. (2025), an 18-LLM benchmark across three biomedical SRs, as a motivating example, we reviewed 28 additional papers and extracted their reported metrics. We propose a Weighted Matthews Correlation Coefficient (WMCC) that integrates MCC's chance-correction with asymmetric misclassification costs, and validated it on three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.