MedMeta: A Benchmark for LLMs in Synthesizing Meta-Analysis Conclusion from Medical Studies

Huy Hoang Ha; Benoit Favre; Francois Portet

arXiv:2605.09661·cs.CL·May 12, 2026

MedMeta: A Benchmark for LLMs in Synthesizing Meta-Analysis Conclusion from Medical Studies

Huy Hoang Ha, Benoit Favre, Francois Portet

PDF

TL;DR

MedMeta is a new benchmark for evaluating large language models' ability to synthesize medical meta-analysis conclusions from abstracts, highlighting the importance of information grounding and revealing current limitations.

Contribution

Introduces MedMeta, the first benchmark for LLMs in medical evidence synthesis, with a comprehensive evaluation framework and analysis of model performance and vulnerabilities.

Findings

01

Golden-RAG outperforms parametric-only approaches.

02

Domain-specific fine-tuning has marginal benefits.

03

Models struggle to reject negated evidence.

Abstract

Large language models (LLMs) have saturated standard medical benchmarks that test factual recall, yet their ability to perform higher-order reasoning, such as synthesizing evidence from multiple sources, remains critically under-explored. To address this gap, we introduce MedMeta, the first benchmark designed to evaluate an LLM's ability to generate conclusions from medical meta-analyses using only the abstracts of cited studies. MedMeta comprises 81 meta-analyses from PubMed (2018--2025) and evaluates models using two distinct workflows: a Retrieval-Augmented Generation (Golden-RAG) setting with ground-truth abstracts, and a Parametric-only approach relying on internal knowledge. Our evaluation framework is validated by a well-structured analysis showing our LLM-as-a-judge protocol strongly aligns with human expert ratings, as evidenced by high Pearson's r correlation (0.81) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.