SESR-Eval: Dataset for Evaluating LLMs in the Title-Abstract Screening of Systematic Reviews

Aleksi Huotala; Miikka Kuutila; Mika M\"antyl\"a

arXiv:2507.19027·cs.SE·December 25, 2025

SESR-Eval: Dataset for Evaluating LLMs in the Title-Abstract Screening of Systematic Reviews

Aleksi Huotala, Miikka Kuutila, Mika M\"antyl\"a

PDF

Open Access

TL;DR

This paper introduces SESR-Eval, a benchmark dataset with over 34,000 labeled studies, to evaluate LLMs in software engineering systematic review screening, revealing current limitations in accuracy and cost-effectiveness.

Contribution

The paper provides the SESR-Eval dataset and benchmarking results, enabling assessment of LLM performance in SR screening within software engineering.

Findings

01

Most LLMs perform similarly in screening accuracy.

02

Differences between secondary studies exceed differences between LLMs.

03

Using an LLM costs less than $40 per secondary study.

Abstract

Background: The use of large language models (LLMs) in the title-abstract screening process of systematic reviews (SRs) has shown promising results, but suffers from limited performance evaluation. Aims: Create a benchmark dataset to evaluate the performance of LLMs in the title-abstract screening process of SRs. Provide evidence whether using LLMs in title-abstract screening in software engineering is advisable. Method: We start with 169 SR research artifacts and find 24 of those to be suitable for inclusion in the dataset. Using the dataset we benchmark title-abstract screening using 9 LLMs. Results: We present the SESR-Eval (Software Engineering Systematic Review Evaluation) dataset containing 34,528 labeled primary studies, sourced from 24 secondary studies published in software engineering (SE) journals. Most LLMs performed similarly and the differences in screening accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMeta-analysis and systematic reviews · Artificial Intelligence in Healthcare and Education · Scientific Computing and Data Management