Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE   Questions

Soumyadeep Roy; Aparup Khatua; Fatemeh Ghoochani; Uwe Hadler; Wolfgang; Nejdl; Niloy Ganguly

arXiv:2404.13307·cs.CL·April 23, 2024

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Soumyadeep Roy, Aparup Khatua, Fatemeh Ghoochani, Uwe Hadler, Wolfgang, Nejdl, Niloy Ganguly

PDF

1 Repo

TL;DR

This paper investigates the types of errors GPT-4 makes on USMLE questions, introduces a detailed error taxonomy and dataset, and analyzes the reasoning behind incorrect responses to improve understanding of LLMs in medical QA.

Contribution

It presents a new error taxonomy and a large annotated dataset for analyzing GPT-4's medical question responses, highlighting the complexity of reasoning errors.

Findings

01

A significant portion of errors are reasonable responses, indicating reasoning challenges.

02

The dataset includes detailed explanations and medical concepts for in-depth analysis.

03

Resources are publicly available for further research.

Abstract

GPT-4 demonstrates high accuracy in medical QA tasks, leading with an accuracy of 86.70%, followed by Med-PaLM 2 at 86.50%. However, around 14% of errors remain. Additionally, current works use GPT-4 to only predict the correct option without providing any explanation and thus do not provide any insight into the thinking process and reasoning used by GPT-4 or other LLMs. Therefore, we introduce a new domain-specific error taxonomy derived from collaboration with medical students. Our GPT-4 USMLE Error (G4UE) dataset comprises 4153 GPT-4 correct responses and 919 incorrect responses to the United States Medical Licensing Examination (USMLE) respectively. These responses are quite long (258 words on average), containing detailed explanations from GPT-4 justifying the selected option. We then launch a large-scale annotation study using the Potato annotation platform and recruit 44 medical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

roysoumya/usmle-gpt4-error-taxonomy
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Dropout · Dense Connections · Label Smoothing · Residual Connection · Softmax · Adam