CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field
Doria Bonzi, Alexandre Guiggi, Fr\'ed\'eric B\'echet, Carlos Ramisch, Benoit Favre

TL;DR
CareMedEval is a new dataset for evaluating language models on critical appraisal and reasoning in biomedical literature, highlighting current limitations of models in understanding scientific papers.
Contribution
The paper introduces CareMedEval, a novel dataset derived from medical student exams to assess LLMs' biomedical critical reasoning capabilities.
Findings
State-of-the-art models struggle to exceed 50% accuracy.
Intermediate reasoning improves model performance.
Models find questions on study limitations and statistics particularly challenging.
Abstract
Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Academic integrity and plagiarism · scientometrics and bibliometrics research
