Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing   LLM Ophthalmological QA in LMICs

David Restrepo; Chenwei Wu; Zhengxu Tang; Zitao Shuai; Thao Nguyen; Minh Phan; Jun-En Ding; Cong-Tinh Dao; Jack Gallifant; Robyn Gayle Dychiao,; Jose Carlo Artiaga; Andr\'e Hiroshi Bando; Carolina Pelegrini Barbosa; Gracitelli; Vincenz Ferrer; Leo Anthony Celi; Danielle Bitterman; Michael G; Morley; Luis Filipe Nakayama

arXiv:2412.14304·cs.CL·December 20, 2024

Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs

David Restrepo, Chenwei Wu, Zhengxu Tang, Zitao Shuai, Thao Nguyen, Minh Phan, Jun-En Ding, Cong-Tinh Dao, Jack Gallifant, Robyn Gayle Dychiao,, Jose Carlo Artiaga, Andr\'e Hiroshi Bando, Carolina Pelegrini Barbosa, Gracitelli, Vincenz Ferrer, Leo Anthony Celi

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces a multilingual ophthalmological question-answering benchmark and a novel de-biasing method, CLARA, to improve LLM performance and fairness across languages in medical applications for LMICs.

Contribution

It presents the first multilingual ophthalmological QA benchmark and proposes CLARA, a new de-biasing approach that enhances LLM performance and reduces language bias in medical contexts.

Findings

01

Substantial language bias in LLM performance for ophthalmological QA.

02

Existing de-biasing methods are insufficient for medical multilingual tasks.

03

CLARA significantly improves performance and reduces bias across languages.

Abstract

Current ophthalmology clinical workflows are plagued by over-referrals, long waits, and complex and heterogeneous medical records. Large language models (LLMs) present a promising solution to automate various procedures such as triaging, preliminary tests like visual acuity assessment, and report summaries. However, LLMs have demonstrated significantly varied performance across different languages in natural language question-answering tasks, potentially exacerbating healthcare disparities in Low and Middle-Income Countries (LMICs). This study introduces the first multilingual ophthalmological question-answering benchmark with manually curated questions parallel across languages, allowing for direct cross-lingual comparisons. Our evaluation of 6 popular LLMs across 7 different languages reveals substantial bias across different languages, highlighting risks for clinical deployment of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

AAAIBenchmark/Multi-Opthalingua
dataset· 33 dl
33 dl

Videos

Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs· underline

Taxonomy

TopicsRetinal Imaging and Analysis · Biomedical Text Mining and Ontologies · Acute Ischemic Stroke Management