LLM-Assisted Abstract Screening with OLIVER: Evaluating Calibration and Single-Model vs. Actor-Critic Configurations in Literature Reviews

Kian Godhwani; David Benrimoh

arXiv:2512.20022·cs.IR·December 24, 2025

LLM-Assisted Abstract Screening with OLIVER: Evaluating Calibration and Single-Model vs. Actor-Critic Configurations in Literature Reviews

Kian Godhwani, David Benrimoh

PDF

Open Access

TL;DR

This paper introduces OLIVER, an open-source LLM-assisted screening pipeline, evaluates multiple models and configurations across systematic reviews, and finds that actor-critic frameworks enhance screening accuracy and calibration over single models.

Contribution

The study presents OLIVER, compares LLM configurations, and demonstrates that actor-critic models improve screening performance and calibration in literature reviews.

Findings

01

Model performance varies widely across reviews.

02

Actor-critic framework improves discrimination and calibration.

03

Calibration remains weak in single-model setups.

Abstract

Introduction: Recent work suggests large language models (LLMs) can accelerate screening, but prior evaluations focus on earlier LLMs, standardized Cochrane reviews, single-model setups, and accuracy as the primary metric, leaving generalizability, configuration effects, and calibration largely unexamined. Methods: We developed OLIVER (Optimized LLM-based Inclusion and Vetting Engine for Reviews), an open-source pipeline for LLM-assisted abstract screening. We evaluated multiple contemporary LLMs across two non-Cochrane systematic reviews and performance was assessed at both the full-text screening and final inclusion stages using accuracy, AUC, and calibration metrics. We further tested an actor-critic screening framework combining two lightweight models under three aggregation rules. Results: Across individual models, performance varied widely. In the smaller Review 1 (821…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Meta-analysis and systematic reviews · Topic Modeling