Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams
Zheheng Luo, Chenhan Yuan, Qianqian Xie, Sophia Ananiadou

TL;DR
This paper introduces EMPEC, a comprehensive Chinese healthcare knowledge benchmark across diverse professions, evaluating 17 LLMs and revealing strengths and gaps in their medical understanding and multilingual capabilities.
Contribution
It presents EMPEC, the first large-scale, multi-profession Chinese healthcare benchmark, and provides extensive evaluation of various LLMs' performance in healthcare knowledge tasks.
Findings
GPT-4 achieves over 75% accuracy but struggles with specialized fields.
General-purpose LLMs outperform medical-specific models.
Training data from EMPEC improves model performance.
Abstract
Recent advancements in Large Language Models (LLMs) have demonstrated their potential in delivering accurate answers to questions about world knowledge. Despite this, existing benchmarks for evaluating LLMs in healthcare predominantly focus on medical doctors, leaving other critical healthcare professions underrepresented. To fill this research gap, we introduce the Examinations for Medical Personnel in Chinese (EMPEC), a pioneering large-scale healthcare knowledge benchmark in traditional Chinese. EMPEC consists of 157,803 exam questions across 124 subjects and 20 healthcare professions, including underrepresented occupations like Optometrists and Audiologists. Each question is tagged with its release time and source, ensuring relevance and authenticity. We conducted extensive experiments on 17 LLMs, including proprietary, open-source models, general domain models and medical specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Economic and Financial Impacts of Cancer
MethodsSparse Evolutionary Training · Residual Connection · Softmax · Layer Normalization · Focus · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer
