Can Large Language Models Match the Conclusions of Systematic Reviews?
Christopher Polzak, Alejandro Lozano, Min Woo Sun, James Burgess, Yuhui Zhang, Kevin Wu, and Serena Yeung-Levy

TL;DR
This study evaluates whether large language models can replicate the conclusions of systematic reviews in medicine, revealing current limitations in reasoning, skepticism, and performance consistency across models and sizes.
Contribution
The paper introduces MedEvidence, a benchmark dataset for assessing LLMs on systematic review tasks, and provides a comprehensive evaluation of 24 models highlighting their current shortcomings.
Findings
Performance degrades with longer token inputs
Larger models do not always perform better
Knowledge fine-tuning reduces accuracy
Abstract
Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly characterized. We therefore ask: Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies? To explore this question, we present MedEvidence, a benchmark pairing findings from 100 SRs with the studies they are based on. We benchmark 24 LLMs on MedEvidence, including reasoning,…
Peer Reviews
Decision·ICLR 2026 Poster
S1) This is a well-posed task and this paper makes targeted contributions. The paper removes retrieval and long-summary grading by converting conclusions into closed-QA and evaluating exact match answers. S2) Evaluation is reasonbly transparent with limited uncertainty. Metrics (e.g. per-class recall, accuracy, evidence uncertainty w/ source concordance) reliably support key findings. Work is methodically sound. S3) Clear empirical takeaways -- I am largely satisfied with their takeaways of
W1) I think perhaps a reasonable weakness/questions here is re conceptual novelty and whether this paper is suited for ICLR. Essentially authors do a dataset+evaluation work with closed-class answers. While useful, it advances prior factuality/evidence-reasoning datasets incrementally (i.e. Table 1) and centers on mapping mapping SR conclusions to a QA rather than introducing new modeling/eval methods. W2) LLM-derived source concordance. While I think this is a reasonable thing to do for eval/
This is a well written paper with a scoped contribution that explicitly tests modern LLMs (incl. reasoning models) ability to synthesize evidence across systematic reviews; the analysis is solid and has high relevance for clinical settings where these models may be deployed.
My main concern with this work is the lack of adequate comparison with prior work. It’s unclear whether the dataset itself is a novel contribution; there is a lack of interaction with prior work on evaluating LLM evidence synthesis and existing cleaned SR datasets: [1] Three datasets built similarly: TrialReviewBench (https://arxiv.org/html/2407.00631v2), https://arxiv.org/abs/2008.11293, and https://aclanthology.org/2024.acl-srw.42/. [2] There is some prior work that finds similar conclusi
* All test cases are manually curated based on existing Cochran meta-analyses. * A large number of LMs are evaluated on the benchmark.
* The soundness of the benchmark is heavily based on the annotation quality, but there is no discussion about the annotators’ background. * The only task provided by MedEvidence so far is a multiple-choice/classification task, as the model has to answer one of five given treatment outcome effects. Please see the questions for details. * There are no numerical experimental statistics in the main paper. All results are presented in plots.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMeta-analysis and systematic reviews · Artificial Intelligence in Healthcare and Education · Topic Modeling
