MedConceptsQA: Open Source Medical Concepts QA Benchmark

Ofir Ben Shoham; Nadav Rappoport

arXiv:2405.07348·cs.CL·May 15, 2024

MedConceptsQA: Open Source Medical Concepts QA Benchmark

Ofir Ben Shoham, Nadav Rappoport

PDF

Open Access 1 Repo 1 Datasets

TL;DR

MedConceptsQA is an open benchmark for evaluating medical concept question answering, revealing that current clinical language models perform poorly, but GPT-4 significantly outperforms them in accuracy.

Contribution

Introduces MedConceptsQA, a new open-source benchmark for medical concept QA, and provides evaluation results highlighting GPT-4's superior performance.

Findings

01

Clinical LLMs perform near random on the benchmark.

02

GPT-4 improves accuracy by 27-37% over clinical LLMs.

03

Benchmark covers diagnoses, procedures, and drugs across difficulty levels.

Abstract

We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conducted evaluations of the benchmark using various Large Language Models. Our findings show that pre-trained clinical Large Language Models achieved accuracy levels close to random guessing on this benchmark, despite being pre-trained on medical data. However, GPT-4 achieves an absolute average improvement of nearly 27%-37% (27% for zero-shot learning and 37% for few-shot learning) when compared to clinical Large Language Models. Our benchmark serves as a valuable resource for evaluating the understanding and reasoning of medical concepts by Large Language Models. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nadavlab/MedConceptsQA
pytorchOfficial

Datasets

ChuGyouk/KorMedConceptsQA
dataset· 66 dl
66 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHealth and Medical Research Impacts

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Position-Wise Feed-Forward Layer · Dropout · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding