PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language

Naghmeh Jamali; Milad Mohammadi; Danial Baledi; Zahra Rezvani; Hesham Faili

arXiv:2505.18331·cs.CL·May 27, 2025

PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language

Naghmeh Jamali, Milad Mohammadi, Danial Baledi, Zahra Rezvani, Hesham Faili

PDF

Open Access 1 Datasets

TL;DR

PerMedCQA introduces the first Persian-language benchmark for evaluating large language models on real-world medical consumer questions, highlighting challenges and guiding future improvements in multilingual medical AI systems.

Contribution

This paper presents PerMedCQA, a new Persian medical QA benchmark, and a novel evaluation framework using LLM-based grading validated by experts.

Findings

01

Multilingual LLMs face significant challenges in medical QA.

02

The MedJudge framework effectively evaluates LLM responses.

03

Insights into improving context-awareness in medical AI systems.

Abstract

Medical consumer question answering (CQA) is crucial for empowering patients by providing personalized and reliable health information. Despite recent advances in large language models (LLMs) for medical QA, consumer-oriented and multilingual resources, particularly in low-resource languages like Persian, remain sparse. To bridge this gap, we present PerMedCQA, the first Persian-language benchmark for evaluating LLMs on real-world, consumer-generated medical questions. Curated from a large medical QA forum, PerMedCQA contains 68,138 question-answer pairs, refined through careful data cleaning from an initial set of 87,780 raw entries. We evaluate several state-of-the-art multilingual and instruction-tuned LLMs, utilizing MedJudge, a novel rubric-based evaluation framework driven by an LLM grader, validated against expert human annotators. Our results highlight key challenges in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

NaghmehAI/PerMedCQA
dataset· 27 dl
27 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Sentiment Analysis and Opinion Mining

MethodsSparse Evolutionary Training