MedMeta: A Benchmark for LLMs in Synthesizing Meta-Analysis Conclusion from Medical Studies
Huy Hoang Ha, Benoit Favre, Francois Portet

TL;DR
MedMeta is a new benchmark for evaluating large language models' ability to synthesize medical meta-analysis conclusions from abstracts, highlighting the importance of information grounding and revealing current limitations.
Contribution
Introduces MedMeta, the first benchmark for LLMs in medical evidence synthesis, with a comprehensive evaluation framework and analysis of model performance and vulnerabilities.
Findings
Golden-RAG outperforms parametric-only approaches.
Domain-specific fine-tuning has marginal benefits.
Models struggle to reject negated evidence.
Abstract
Large language models (LLMs) have saturated standard medical benchmarks that test factual recall, yet their ability to perform higher-order reasoning, such as synthesizing evidence from multiple sources, remains critically under-explored. To address this gap, we introduce MedMeta, the first benchmark designed to evaluate an LLM's ability to generate conclusions from medical meta-analyses using only the abstracts of cited studies. MedMeta comprises 81 meta-analyses from PubMed (2018--2025) and evaluates models using two distinct workflows: a Retrieval-Augmented Generation (Golden-RAG) setting with ground-truth abstracts, and a Parametric-only approach relying on internal knowledge. Our evaluation framework is validated by a well-structured analysis showing our LLM-as-a-judge protocol strongly aligns with human expert ratings, as evidenced by high Pearson's r correlation (0.81) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
