SESR-Eval: Dataset for Evaluating LLMs in the Title-Abstract Screening of Systematic Reviews
Aleksi Huotala, Miikka Kuutila, Mika M\"antyl\"a

TL;DR
This paper introduces SESR-Eval, a benchmark dataset with over 34,000 labeled studies, to evaluate LLMs in software engineering systematic review screening, revealing current limitations in accuracy and cost-effectiveness.
Contribution
The paper provides the SESR-Eval dataset and benchmarking results, enabling assessment of LLM performance in SR screening within software engineering.
Findings
Most LLMs perform similarly in screening accuracy.
Differences between secondary studies exceed differences between LLMs.
Using an LLM costs less than $40 per secondary study.
Abstract
Background: The use of large language models (LLMs) in the title-abstract screening process of systematic reviews (SRs) has shown promising results, but suffers from limited performance evaluation. Aims: Create a benchmark dataset to evaluate the performance of LLMs in the title-abstract screening process of SRs. Provide evidence whether using LLMs in title-abstract screening in software engineering is advisable. Method: We start with 169 SR research artifacts and find 24 of those to be suitable for inclusion in the dataset. Using the dataset we benchmark title-abstract screening using 9 LLMs. Results: We present the SESR-Eval (Software Engineering Systematic Review Evaluation) dataset containing 34,528 labeled primary studies, sourced from 24 secondary studies published in software engineering (SE) journals. Most LLMs performed similarly and the differences in screening accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMeta-analysis and systematic reviews · Artificial Intelligence in Healthcare and Education · Scientific Computing and Data Management
