TL;DR
This paper introduces BanglaMedVQA, a new dataset for Bangla medical visual question answering, and evaluates current models, revealing significant performance gaps in low-resource language medical reasoning.
Contribution
The creation of BanglaMedVQA dataset and comprehensive benchmarking of foundation models on Bangla medical visual questions.
Findings
Current models perform poorly on Bangla MedVQA, especially on complex diagnostic questions.
Top models like GPT-4.1 mini still fail to accurately answer specialized medical questions.
Open-source models like Gemma-3 sometimes outperform top models in general categories.
Abstract
Recent advancements in Large Language Models (LLMs) and Large Vision Language Models (LVLMs) have enabled general-purpose systems to demonstrate promising capabilities in complex reasoning tasks, including those in the medical domain. Medical Visual Question Answering (MedVQA) has particularly benefited from these developments. However, despite Bangla being one of the most widely spoken languages globally, there exists no established MedVQA benchmark for it. To address this gap, we introduce BanglaMedVQA, a dataset comprising clinically validated image-question-answer pairs, along with a comprehensive evaluation of current foundation models on this resource. Consistent with prior findings that report low performance of current models on English MedVQA benchmarks, our analysis reveals that Bangla performance is substantially lower, reflecting the challenges inherent to low-resource…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
