WorldMedQA-V: a multilingual, multimodal medical examination dataset for   multimodal language models evaluation

Jo\~ao Matos; Shan Chen; Siena Placino; Yingya Li; Juan Carlos Climent; Pardo; Daphna Idan; Takeshi Tohyama; David Restrepo; Luis F. Nakayama; Jose; M. M. Pascual-Leone; Guergana Savova; Hugo Aerts; Leo A. Celi; A. Ian Wong,; Danielle S. Bitterman; Jack Gallifant

arXiv:2410.12722·cs.CL·October 17, 2024·2 cites

WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation

Jo\~ao Matos, Shan Chen, Siena Placino, Yingya Li, Juan Carlos Climent, Pardo, Daphna Idan, Takeshi Tohyama, David Restrepo, Luis F. Nakayama, Jose, M. M. Pascual-Leone, Guergana Savova, Hugo Aerts, Leo A. Celi, A. Ian Wong,, Danielle S. Bitterman, Jack Gallifant

PDF

Open Access 1 Repo 1 Models 2 Datasets 1 Video

TL;DR

WorldMedQA-V is a comprehensive multilingual, multimodal dataset designed to evaluate vision-language models in healthcare, including diverse questions, images, and translations from multiple countries to improve AI fairness and effectiveness.

Contribution

It introduces a novel multilingual, multimodal medical QA dataset with images and translations from four countries, filling gaps in existing text-only benchmarks.

Findings

01

Baseline models show varied performance across languages and modalities.

02

The dataset enables evaluation of models in diverse healthcare settings.

03

It promotes development of more equitable and effective AI in medicine.

Abstract

Multimodal/vision language models (VLMs) are increasingly being deployed in healthcare settings worldwide, necessitating robust benchmarks to ensure their safety, efficacy, and fairness. Multiple-choice question and answer (QA) datasets derived from national medical examinations have long served as valuable evaluation tools, but existing datasets are largely text-only and available in a limited subset of languages and countries. To address these challenges, we present WorldMedQA-V, an updated multilingual, multimodal benchmarking dataset designed to evaluate VLMs in healthcare. WorldMedQA-V includes 568 labeled multiple-choice QAs paired with 568 medical images from four countries (Brazil, Israel, Japan, and Spain), covering original languages and validated English translations by native clinicians, respectively. Baseline performance for common open- and closed-source models are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

WorldMedQA/V
noneOfficial

Models

🤗
tuandunghcmut/vlmeval
model

Datasets

Videos

WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation· underline

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education