MedExpQA: Multilingual Benchmarking of Large Language Models for Medical   Question Answering

I\~nigo Alonso; Maite Oronoz; Rodrigo Agerri

arXiv:2404.05590·cs.CL·November 12, 2024·Artif. Intell. Medicine·2 cites

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

I\~nigo Alonso, Maite Oronoz, Rodrigo Agerri

PDF

Open Access 10 Models

TL;DR

MedExpQA introduces a pioneering multilingual medical question answering benchmark with gold explanations from doctors, revealing significant performance gaps in LLMs, especially in non-English languages, and highlighting challenges in medical knowledge integration.

Contribution

This work presents the first multilingual medical QA benchmark with expert-annotated explanations, enabling comprehensive evaluation of LLMs and RAG methods across languages.

Findings

01

LLMs show large room for improvement in medical QA.

02

Performance drops significantly in non-English languages.

03

RAG methods face challenges in medical knowledge integration.

Abstract

Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support, which has been demonstrated by their competitive performances in Medical QA. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged by outdated knowledge and by their tendency to generate hallucinated content. Furthermore, most benchmarks to assess medical knowledge lack reference gold explanations which means that it is not possible to evaluate the reasoning of LLMs predictions. Finally, the situation is particularly grim if we consider benchmarking LLMs for languages other than English which remains, as far as we know, a totally neglected topic. In order to address these shortcomings, in this paper we present MedExpQA, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Biomedical Text Mining and Ontologies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Weight Decay · Byte Pair Encoding · Linear Layer · Dense Connections · Attention Dropout · Residual Connection · Linear Warmup With Linear Decay