MedConceptsQA: Open Source Medical Concepts QA Benchmark
Ofir Ben Shoham, Nadav Rappoport

TL;DR
MedConceptsQA is an open benchmark for evaluating medical concept question answering, revealing that current clinical language models perform poorly, but GPT-4 significantly outperforms them in accuracy.
Contribution
Introduces MedConceptsQA, a new open-source benchmark for medical concept QA, and provides evaluation results highlighting GPT-4's superior performance.
Findings
Clinical LLMs perform near random on the benchmark.
GPT-4 improves accuracy by 27-37% over clinical LLMs.
Benchmark covers diagnoses, procedures, and drugs across difficulty levels.
Abstract
We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conducted evaluations of the benchmark using various Large Language Models. Our findings show that pre-trained clinical Large Language Models achieved accuracy levels close to random guessing on this benchmark, despite being pre-trained on medical data. However, GPT-4 achieves an absolute average improvement of nearly 27%-37% (27% for zero-shot learning and 37% for few-shot learning) when compared to clinical Large Language Models. Our benchmark serves as a valuable resource for evaluating the understanding and reasoning of medical concepts by Large Language Models. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHealth and Medical Research Impacts
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Position-Wise Feed-Forward Layer · Dropout · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding
