Can Large Language Models Match the Conclusions of Systematic Reviews?

Christopher Polzak; Alejandro Lozano; Min Woo Sun; James Burgess; Yuhui Zhang; Kevin Wu; and Serena Yeung-Levy

arXiv:2505.22787·cs.CL·May 30, 2025

Can Large Language Models Match the Conclusions of Systematic Reviews?

Christopher Polzak, Alejandro Lozano, Min Woo Sun, James Burgess, Yuhui Zhang, Kevin Wu, and Serena Yeung-Levy

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This study evaluates whether large language models can replicate the conclusions of systematic reviews in medicine, revealing current limitations in reasoning, skepticism, and performance consistency across models and sizes.

Contribution

The paper introduces MedEvidence, a benchmark dataset for assessing LLMs on systematic review tasks, and provides a comprehensive evaluation of 24 models highlighting their current shortcomings.

Findings

01

Performance degrades with longer token inputs

02

Larger models do not always perform better

03

Knowledge fine-tuning reduces accuracy

Abstract

Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly characterized. We therefore ask: Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies? To explore this question, we present MedEvidence, a benchmark pairing findings from 100 SRs with the studies they are based on. We benchmark 24 LLMs on MedEvidence, including reasoning,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

S1) This is a well-posed task and this paper makes targeted contributions. The paper removes retrieval and long-summary grading by converting conclusions into closed-QA and evaluating exact match answers. S2) Evaluation is reasonbly transparent with limited uncertainty. Metrics (e.g. per-class recall, accuracy, evidence uncertainty w/ source concordance) reliably support key findings. Work is methodically sound. S3) Clear empirical takeaways -- I am largely satisfied with their takeaways of

Weaknesses

W1) I think perhaps a reasonable weakness/questions here is re conceptual novelty and whether this paper is suited for ICLR. Essentially authors do a dataset+evaluation work with closed-class answers. While useful, it advances prior factuality/evidence-reasoning datasets incrementally (i.e. Table 1) and centers on mapping mapping SR conclusions to a QA rather than introducing new modeling/eval methods. W2) LLM-derived source concordance. While I think this is a reasonable thing to do for eval/

Reviewer 02Rating 4Confidence 5

Strengths

This is a well written paper with a scoped contribution that explicitly tests modern LLMs (incl. reasoning models) ability to synthesize evidence across systematic reviews; the analysis is solid and has high relevance for clinical settings where these models may be deployed.

Weaknesses

My main concern with this work is the lack of adequate comparison with prior work. It’s unclear whether the dataset itself is a novel contribution; there is a lack of interaction with prior work on evaluating LLM evidence synthesis and existing cleaned SR datasets: [1] Three datasets built similarly: TrialReviewBench (https://arxiv.org/html/2407.00631v2), https://arxiv.org/abs/2008.11293, and https://aclanthology.org/2024.acl-srw.42/. [2] There is some prior work that finds similar conclusi

Reviewer 03Rating 6Confidence 3

Strengths

* All test cases are manually curated based on existing Cochran meta-analyses. * A large number of LMs are evaluated on the benchmark.

Weaknesses

* The soundness of the benchmark is heavily based on the annotation quality, but there is no discussion about the annotators’ background. * The only task provided by MedEvidence so far is a multiple-choice/classification task, as the model has to answer one of five given treatment outcome effects. Please see the questions for details. * There are no numerical experimental statistics in the main paper. All results are presented in plots.

Code & Models

Repositories

zy-f/med-evidence
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMeta-analysis and systematic reviews · Artificial Intelligence in Healthcare and Education · Topic Modeling