MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering
I\~nigo Alonso, Maite Oronoz, Rodrigo Agerri

TL;DR
MedExpQA introduces a pioneering multilingual medical question answering benchmark with gold explanations from doctors, revealing significant performance gaps in LLMs, especially in non-English languages, and highlighting challenges in medical knowledge integration.
Contribution
This work presents the first multilingual medical QA benchmark with expert-annotated explanations, enabling comprehensive evaluation of LLMs and RAG methods across languages.
Findings
LLMs show large room for improvement in medical QA.
Performance drops significantly in non-English languages.
RAG methods face challenges in medical knowledge integration.
Abstract
Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support, which has been demonstrated by their competitive performances in Medical QA. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged by outdated knowledge and by their tendency to generate hallucinated content. Furthermore, most benchmarks to assess medical knowledge lack reference gold explanations which means that it is not possible to evaluate the reasoning of LLMs predictions. Finally, the situation is particularly grim if we consider benchmarking LLMs for languages other than English which remains, as far as we know, a totally neglected topic. In order to address these shortcomings, in this paper we present MedExpQA, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/medgemma-1.5-4b-itmodel· 86k dl· ♡ 53686k dl♡ 536
- 🤗google/medgemma-4b-itmodel· 170k dl· ♡ 925170k dl♡ 925
- 🤗unsloth/medgemma-27b-it-GGUFmodel· 4.4k dl· ♡ 384.4k dl♡ 38
- 🤗google/medgemma-4b-ptmodel· 1.1k dl· ♡ 1481.1k dl♡ 148
- 🤗google/medgemma-27b-text-itmodel· 37k dl· ♡ 41237k dl♡ 412
- 🤗google/medgemma-27b-itmodel· 107k dl· ♡ 330107k dl♡ 330
- 🤗pszemraj/medgemma-4b-it-hereticmodel· 46 dl· ♡ 546 dl♡ 5
- 🤗pszemraj/medgemma-27b-text-heretic_medmodel· 11 dl· ♡ 511 dl♡ 5
- 🤗unsloth/medgemma-1.5-4b-it-GGUFmodel· 6.7k dl· ♡ 336.7k dl♡ 33
- 🤗HiTZ/Mistral-7B-MedExpQA-ENmodel· 5 dl5 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Weight Decay · Byte Pair Encoding · Linear Layer · Dense Connections · Attention Dropout · Residual Connection · Linear Warmup With Linear Decay
