MedQA-CS: Objective Structured Clinical Examination (OSCE)-Style Benchmark for Evaluating LLM Clinical Skills

Zonghai Yao; Zihao Zhang; Chaolong Tang; Xingyu Bian; Youxia Zhao; Zhichao Yang; Junda Wang; Huixue Zhou; Won Seok Jang; Feiyun Ouyang; Hong Yu

arXiv:2410.01553·cs.AI·January 21, 2026·2 cites

MedQA-CS: Objective Structured Clinical Examination (OSCE)-Style Benchmark for Evaluating LLM Clinical Skills

Zonghai Yao, Zihao Zhang, Chaolong Tang, Xingyu Bian, Youxia Zhao, Zhichao Yang, Junda Wang, Huixue Zhou, Won Seok Jang, Feiyun Ouyang, Hong Yu

PDF

Open Access 1 Repo 1 Video

TL;DR

MedQA-CS introduces an OSCE-inspired benchmark to evaluate large language models' clinical skills in healthcare, providing a comprehensive and challenging assessment framework with expert annotations.

Contribution

It develops a novel evaluation framework, MedQA-CS, with publicly available data and expert annotations, to better assess LLMs' clinical skills in healthcare.

Findings

01

MedQA-CS is more challenging than traditional QA benchmarks.

02

LLMs show varied performance across clinical scenarios.

03

The framework enables comprehensive evaluation of LLMs' clinical capabilities.

Abstract

Artificial intelligence (AI) and large language models (LLMs) in healthcare require advanced clinical skills (CS), yet current benchmarks fail to evaluate these comprehensively. We introduce MedQA-CS, an AI-SCE framework inspired by medical education's Objective Structured Clinical Examinations (OSCEs), to address this gap. MedQA-CS evaluates LLMs through two instruction-following tasks, LLM-as-medical-student and LLM-as-CS-examiner, designed to reflect real clinical scenarios. Our contributions include developing MedQA-CS, a comprehensive evaluation framework with publicly available data and expert annotations, and providing the quantitative and qualitative assessment of LLMs as reliable judges in CS evaluation. Our experiments show that MedQA-CS is a more challenging benchmark for evaluating clinical skills than traditional multiple-choice QA benchmarks (e.g., MedQA). Combined with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bio-nlp/medqa-cs
noneOfficial

Videos

MedQA-CS: Objective Structured Clinical Examination (OSCE)-Style Benchmark for Evaluating LLM Clinical Skills· underline

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education