PersianMedQA: Evaluating Large Language Models on a Persian-English Bilingual Medical Question Answering Benchmark

Mohammad Javad Ranjbar Kalahroodi; Amirhossein Sheikholselami; Sepehr Karimi; Sepideh Ranjbar Kalahroodi; Heshaam Faili; Azadeh Shakery

arXiv:2506.00250·cs.CL·August 12, 2025

PersianMedQA: Evaluating Large Language Models on a Persian-English Bilingual Medical Question Answering Benchmark

Mohammad Javad Ranjbar Kalahroodi, Amirhossein Sheikholselami, Sepehr Karimi, Sepideh Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

PDF

Open Access 1 Datasets

TL;DR

PersianMedQA introduces a large bilingual Persian-English medical question dataset to evaluate LLMs, revealing that state-of-the-art models like GPT-4 outperform others, but domain and language adaptation remain crucial.

Contribution

This work presents PersianMedQA, a comprehensive dataset for assessing LLMs in Persian-English medical question answering, highlighting the importance of domain-specific and language adaptation for model performance.

Findings

01

GPT-4 achieves over 80% accuracy in both languages.

02

Fine-tuned Persian models perform significantly worse than general models.

03

Translation can cause loss of cultural and clinical context, affecting accuracy.

Abstract

Large Language Models (LLMs) have achieved remarkable performance on a wide range of Natural Language Processing (NLP) benchmarks, often surpassing human-level accuracy. However, their reliability in high-stakes domains such as medicine, particularly in low-resource languages, remains underexplored. In this work, we introduce PersianMedQA, a large-scale dataset of 20,785 expert-validated multiple-choice Persian medical questions from 14 years of Iranian national medical exams, spanning 23 medical specialties and designed to evaluate LLMs in both Persian and English. We benchmark 40 state-of-the-art models, including general-purpose, Persian fine-tuned, and medical LLMs, in zero-shot and chain-of-thought (CoT) settings. Our results show that closed-source general models (e.g., GPT-4.1) consistently outperform all other categories, achieving 83.09% accuracy in Persian and 80.7% in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

MohammadJRanjbar/PersianMedQA
dataset· 4 dl
4 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Artificial Intelligence in Healthcare and Education

MethodsLinear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Adam · Softmax · Label Smoothing · Multi-Head Attention · Attention Is All You Need · Dropout