Do Large Language Models have Shared Weaknesses in Medical Question   Answering?

Andrew M. Bean; Karolina Korgul; Felix Krones; Robert McCraith; Adam; Mahdi

arXiv:2310.07225·cs.CL·October 14, 2024·1 cites

Do Large Language Models have Shared Weaknesses in Medical Question Answering?

Andrew M. Bean, Karolina Korgul, Felix Krones, Robert McCraith, Adam, Mahdi

PDF

Open Access

TL;DR

This study benchmarks 16 top large language models on Polish medical licensing exam questions to identify shared weaknesses and patterns, revealing consistent model behaviors and correlations with human performance.

Contribution

It provides a comprehensive analysis of shared strengths and weaknesses across multiple LLMs in medical question answering, highlighting persistent patterns likely to continue in future models.

Findings

01

Models show positive correlation in correct answers across questions.

02

Model performance correlates with human performance and confidence.

03

Longer questions tend to decrease model accuracy.

Abstract

Large language models (LLMs) have made rapid improvement on medical benchmarks, but their unreliability remains a persistent challenge for safe real-world uses. To design for the use LLMs as a category, rather than for specific models, requires developing an understanding of shared strengths and weaknesses which appear across models. To address this challenge, we benchmark a range of top LLMs and identify consistent patterns across models. We test $16$ well-known LLMs on $874$ newly collected questions from Polish medical licensing exams. For each question, we score each model on the top-1 accuracy and the distribution of probabilities assigned. We then compare these results with factors such as question difficulty for humans, question length, and the scores of the other models. LLM accuracies were positively correlated pairwise ( $0.39$ to $0.58$ ). Model performance was also correlated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Artificial Intelligence in Healthcare and Education

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Linear Layer · Label Smoothing · Transformer · Dropout