# Evaluating large language models for clinical note processing: local fine-tuning and internal-external validation using electronic health records from South Asia

**Authors:** Seyed Alireza Hasheminasab, Faisal Jamil, Muhammad Usman Afzal, Ali Haider Khan, Sehrish Ilyas, Ali Noor, Awais Touseef, Salma Abbas, Hajira Nisar Cheema, Muhammad Usman Shabbir, Iqra Hameed, Maleeha Ayub, Hamayal Masood, Amina Jafar, Amir Mukhtar Khan, Muhammad Abid Nazir, Muhammad Asaad Jamil, Faisal Sultan, Sara Khalid

PMC · DOI: 10.1186/s12911-026-03366-8 · BMC Medical Informatics and Decision Making · 2026-02-25

## TL;DR

This study evaluates how well large language models perform on clinical tasks in South Asia and finds that local fine-tuning significantly improves their reliability.

## Contribution

The study demonstrates that fine-tuning large language models with local electronic health records improves performance in clinical tasks in resource-limited settings.

## Key findings

- Medical LLMs performed poorly on local data without fine-tuning, with F1 score drops of at least 15% for MCE and 35% for MQA.
- Fine-tuning with local EHR data improved performance by 7.5-53% across metrics for both MCE and MQA tasks.
- ChatGPT outperformed other models on local data without fine-tuning, showing better adaptability to regional contexts.

## Abstract

Large Language Models (LLMs) hold the potential for clinical task-shifting by processing unstructured clinical text, enabling tasks such as clinical concept extraction and medical question answering from electronic health records. If implemented reliably, such approaches may benefit over-burdened healthcare systems, particularly in resource-limited settings and for traditionally overlooked populations, provided that local fine-tuning is supported by appropriate clinical and technical expertise. However, this powerful technology remains largely understudied in real-world contexts, particularly in the Global South. This study aims to assess whether openly available LLMs can be used reliably for processing medical notes in real-world settings in South Asia.

We used publicly available LLMs to parse de-identified clinical notes from a large electronic health records (EHR) database in Pakistan, containing hospital records for 8.2 million patients. ChatGPT (GPT-3.5) as a general-purpose LLM, and GatorTron (base), BioMegatron, BioBert and ClinicalBERT as medical LLMs were evaluated when applied to these data, after fine-tuning them with (a) publicly available clinical datasets namely Informatics for Integrating Biology & the Bedside (I2B2) and National NLP Clinical Challenges (N2C2) for medical concept extraction (MCE) and emrQA for medical question answering (MQA), and (b) the local Pakistani de-identified EHR dataset, which includes inpatient Discharge Summaries (DS) and Subjective, Objective, Assessment, and Plan (SOAP) notes, as detailed in this paper. MCE models were applied to these clinical notes using both 3-label and 9-label formats, while MQA models were applied to medical questions. Internal and external validation performance was measured for (a) and (b) using F1 score, precision, recall, and accuracy for MCE and BLEU and ROUGE-L, which measure lexical and sequence similarity, for MQA.

When clinical LLMs were not fine-tuned on the local EHR dataset, their performance during external validation on local data was notably poorer compared to internal validation on the dataset used for fine-tuning, with reductions of at least 15% in F1 scores for MCE and 35% in ROUGE-L and BLEU scores for MQA tasks. This suggests potential bias and highlights the inability of the medical LLMs to reliably handle the data distribution of the local population without further fine-tuning and adaptation. This trend persisted across two distinct natural language processing tasks: concept extraction and question answering, spanning a spectrum of task complexities. However, fine-tuning the LLMs with local EHR data significantly improved model performance across both tasks, yielding a 7.5% to 15% increase in the F1 score for MCE and a 27% to 53% increase in ROUGE-L and BLEU scores for MQA. Notably, ChatGPT, as a general-purpose LLM, stood out as an exception, demonstrating superior performance across all measured metrics on the local dataset compared to the publicly available dataset, with improvements ranging from 3% to 17% on the local EHR dataset, even without fine-tuning on the local data.

Publicly available LLMs, predominantly trained on data from high-income regions, were found to be unreliable when applied in a real-world clinical setting in Pakistan. Fine-tuning them with local EHR data and regional clinical contexts improved their reliability, demonstrating a feasible adaptation strategy that is substantially less resource-intensive than training large language models from scratch. Close collaboration between local clinical and technical experts to curate and leverage more representative, inclusive, and unbiased medical datasets, can play a crucial role in further ensuring reliability of LLMs for resource-limited, overburdened settings, to be used in ways that are safe, fair, and beneficial for all.

The online version contains supplementary material available at 10.1186/s12911-026-03366-8.

## Full-text entities

- **Diseases:** SOAP (MESH:D014717), LLMs (MESH:D007806), Cancer (MESH:D009369)
- **Chemicals:** MCE (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** SKMCH&amp;RC — Homo sapiens (Human), High grade B-cell lymphoma with MYC and BCL2 or BCL6 rearrangements, Cancer cell line (CVCL_9U45)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12988631/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12988631/full.md

## References

1 references — full list in the complete paper: https://tomesphere.com/paper/PMC12988631/full.md

---
Source: https://tomesphere.com/paper/PMC12988631