MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts
Weiyue Li, Ruizhi Qian, Yi Li, Yongce Li, Yunfan Long, Jiahui Cai, Yan Luo, Mengyu Wang

TL;DR
MedConclusion introduces a large dataset of structured biomedical abstracts paired with conclusions, enabling evaluation of language models' reasoning abilities in biomedical inference tasks.
Contribution
The paper presents MedConclusion, a new extensive dataset for biomedical conclusion generation, and evaluates LLMs' performance on evidence-to-conclusion reasoning.
Findings
Conclusion writing differs from summary writing in behavior.
Current models' scores are closely clustered under automatic metrics.
Judge identity significantly influences scoring outcomes.
Abstract
Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce , a large-scale dataset of PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning. MedConclusion also includes journal-level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge. We find that conclusion writing is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
