LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews
Lech Madeyski, Barbara Kitchenham, Martin Shepperd

TL;DR
This paper introduces WMCC, a new metric for evaluating large language models in literature screening, emphasizing cost-sensitive assessment and comprehensive reporting practices.
Contribution
It proposes WMCC, a cost-sensitive evaluation metric, and provides practical guidelines for assessing LLM performance in systematic review screening.
Findings
WMCC often disagrees with accuracy and MCC in ranking LLMs.
Most studies lack full confusion matrix reporting and cost-sensitive metrics.
Using WMCC reduces false negatives in literature screening evaluations.
Abstract
Context: Large language models (LLMs) are increasingly used to screen literature for systematic reviews (SRs), but the standard confusion-matrix metrics used to evaluate them can mislead under the imbalanced, cost-asymmetric conditions of screening. Objective: We develop and justify LLM4SCREENLIT-practical recommendations for researchers conducting LLM-screening evaluations and for editors and reviewers assessing such studies-differentiated by study type (retrospective benchmarking vs deployment for a specific SR). Method: Using Delgado-Chaves et al. (2025), an 18-LLM benchmark across three biomedical SRs, as a motivating example, we reviewed 28 additional papers and extracted their reported metrics. We propose a Weighted Matthews Correlation Coefficient (WMCC) that integrates MCC's chance-correction with asymmetric misclassification costs, and validated it on three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
