MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark   for Language Model Evaluation

Zexue He; Yu Wang; An Yan; Yao Liu; Eric Y. Chang; Amilcare Gentili,; Julian McAuley; Chun-Nan Hsu

arXiv:2310.14088·cs.CL·November 16, 2023·2 cites

MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation

Zexue He, Yu Wang, An Yan, Yao Liu, Eric Y. Chang, Amilcare Gentili,, Julian McAuley, Chun-Nan Hsu

PDF

Open Access

TL;DR

MedEval is a comprehensive, multi-level, multi-task, multi-domain medical benchmark designed to evaluate and improve language models in healthcare, highlighting the importance of instruction tuning for effective medical language understanding.

Contribution

This paper introduces MedEval, a novel medical benchmark with extensive annotated datasets across multiple domains and tasks, enabling systematic evaluation of language models in healthcare.

Findings

01

Large language models show variable effectiveness across tasks.

02

Instruction tuning enhances few-shot performance of large models.

03

Benchmarking reveals strengths and limitations of models in medical contexts.

Abstract

Curated datasets for healthcare are often limited due to the need of human annotations from experts. In this paper, we present MedEval, a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare. MedEval is comprehensive and consists of data from several healthcare systems and spans 35 human body regions from 8 examination modalities. With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data and supporting a wide range of tasks. Moreover, we systematically evaluated 10 generic and domain-specific language models under zero-shot and finetuning settings, from domain-adapted baselines in healthcare to general-purposed state-of-the-art large language models (e.g., ChatGPT). Our evaluations reveal varying effectiveness of the two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning in Healthcare