Are Large Language Models True Healthcare Jacks-of-All-Trades?   Benchmarking Across Health Professions Beyond Physician Exams

Zheheng Luo; Chenhan Yuan; Qianqian Xie; Sophia Ananiadou

arXiv:2406.11328·cs.CL·June 18, 2024·1 cites

Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams

Zheheng Luo, Chenhan Yuan, Qianqian Xie, Sophia Ananiadou

PDF

Open Access 1 Repo

TL;DR

This paper introduces EMPEC, a comprehensive Chinese healthcare knowledge benchmark across diverse professions, evaluating 17 LLMs and revealing strengths and gaps in their medical understanding and multilingual capabilities.

Contribution

It presents EMPEC, the first large-scale, multi-profession Chinese healthcare benchmark, and provides extensive evaluation of various LLMs' performance in healthcare knowledge tasks.

Findings

01

GPT-4 achieves over 75% accuracy but struggles with specialized fields.

02

General-purpose LLMs outperform medical-specific models.

03

Training data from EMPEC improves model performance.

Abstract

Recent advancements in Large Language Models (LLMs) have demonstrated their potential in delivering accurate answers to questions about world knowledge. Despite this, existing benchmarks for evaluating LLMs in healthcare predominantly focus on medical doctors, leaving other critical healthcare professions underrepresented. To fill this research gap, we introduce the Examinations for Medical Personnel in Chinese (EMPEC), a pioneering large-scale healthcare knowledge benchmark in traditional Chinese. EMPEC consists of 157,803 exam questions across 124 subjects and 20 healthcare professions, including underrepresented occupations like Optometrists and Audiologists. Each question is tagged with its release time and source, ensuring relevance and authenticity. We conducted extensive experiments on 17 LLMs, including proprietary, open-source models, general domain models and medical specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhehengluoK/eval_empec
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Economic and Financial Impacts of Cancer

MethodsSparse Evolutionary Training · Residual Connection · Softmax · Layer Normalization · Focus · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer