MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge
Yuxuan Zhou, Xien Liu, Chen Ning, Ji Wu

TL;DR
This paper introduces MultifacetEval, a comprehensive evaluation framework that reveals current large language models lack sufficient depth, precision, and coverage in mastering medical knowledge, highlighting their limited readiness for real-world medical applications.
Contribution
The paper develops a novel multifaceted evaluation framework and datasets to systematically assess LLMs' mastery of medical knowledge across multiple dimensions.
Findings
LLMs perform significantly worse on multifaceted medical questions than on standard benchmarks.
Current LLMs lack depth, precision, and coverage in medical knowledge.
LLMs are not yet suitable for real-world medical tasks.
Abstract
Large language models (LLMs) have excelled across domains, also delivering notable performance on the medical evaluation benchmarks, such as MedQA. However, there still exists a significant gap between the reported performance and the practical effectiveness in real-world medical scenarios. In this paper, we aim to explore the causes of this gap by employing a multifaceted examination schema to systematically probe the actual mastery of medical knowledge by current LLMs. Specifically, we develop a novel evaluation framework MultifacetEval to examine the degree and coverage of LLMs in encoding and mastering medical knowledge at multiple facets (comparison, rectification, discrimination, and verification) concurrently. Based on the MultifacetEval framework, we construct two multifaceted evaluation datasets: MultiDiseK (by producing questions from a clinical disease knowledge base) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWikis in Education and Collaboration · Biomedical Text Mining and Ontologies · Genetics, Bioinformatics, and Biomedical Research
