Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model

Mehrdad Ghassabi; Pedram Rostami; Hamidreza Baradaran Kashani; Amirhossein Poursina; Zahra Kazemi; Milad Tavakoli

arXiv:2505.16000·cs.CL·November 18, 2025

Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model

Mehrdad Ghassabi, Pedram Rostami, Hamidreza Baradaran Kashani, Amirhossein Poursina, Zahra Kazemi, Milad Tavakoli

PDF

Open Access 1 Repo 1 Models 5 Datasets

TL;DR

This paper presents a new Persian medical dataset and fine-tuning approach to improve small language models' medical knowledge, achieving better accuracy and passing medical exams in a resource-limited setting.

Contribution

Introduces the first curated Persian medical dataset and demonstrates effective fine-tuning of a small language model for medical question answering.

Findings

01

Enhanced model accuracy in medical QA tasks

02

Passed the Iranian Basic Medical Science Entrance Exam

03

Improved Persian-translated MMLU accuracy by 2.67%

Abstract

The rapid advancement of language models has demonstrated the potential of artificial intelligence in the healthcare industry. However, small language models struggle with specialized domains in low-resource languages like Persian. While numerous medical-domain websites exist in Persian, no curated dataset or corpus has been available making ours the first of its kind. This study introduces a newly curated dataset comprising 20k doctor-patient Q\&A pairs and 60\% of a 90-million-token crawled corpus from medical magazines. Using a parameter-efficient fine-tuning approach, we enhanced the medical knowledge of the baseline model, aya-expanse-8b. Benchmark evaluations demonstrate that the fine-tuned model achieves improved accuracy in medical question answering and successfully passed the Iranian Basic Medical Science Entrance Exam (IBSEE) in September 2023, which the baseline model did…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mehrdadghassabi/gaokerena
noneOfficial

Models

🤗
gaokerena/gaokerena-v1.0
model· ♡ 3
♡ 3

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare