# Evaluating the Performance of AI Large Language Models in Detecting Pediatric Medication Errors Across Languages: A Comparative Study

**Authors:** Rana K. Abu-Farha, Haneen Abuzaid, Jena Alalawneh, Muna Sharaf, Redab Al-Ghawanmeh, Eyad A. Qunaibi

PMC · DOI: 10.3390/jcm15010162 · Journal of Clinical Medicine · 2025-12-25

## TL;DR

This study compared AI models' ability to detect medication errors in pediatric cases in English and Arabic, finding that Microsoft Copilot performed best overall.

## Contribution

The study evaluates AI models' performance in multilingual pediatric medication error detection, emphasizing the need for improved multilingual training.

## Key findings

- Microsoft Copilot had the highest accuracy in both English and Arabic for detecting medication errors.
- Gemini showed the lowest accuracy and reproducibility across both languages.
- Performance in Arabic was generally lower than in English for most models.

## Abstract

Objectives: This study aimed to evaluate the performance of four AI models, (GPT-5, GPT-4, Microsoft Copilot, and Google Gemini), in detecting medication errors through pediatric case scenarios. Methods: A total of 60 pediatric cases were analyzed for the presence of medication errors, of which only half contained errors. The cases covered four therapeutic systems (respiratory, endocrine, neurology, and infectious). The four models were exposed to the cases in both English and Arabic using a unified prompt. The responses for each model were used to calculate various performance metric cover accuracy, sensitivity, specificity and reproducibility. Analysis was carried out using SPSS version 22. Results: Microsoft Copilot demonstrated relatively higher accuracy (86.7% in English, 85.0% in Arabic) compared to other models in this dataset, followed by GPT-5 (81.7% in English, 75.0% in Arabic). GPT-4 and Google Gemini had less accuracy, with Gemini having the lowest accuracy across all languages (76.7% in English, and 73.3% in Arabic). Microsoft Copilot showed comparatively higher sensitivity and specificity, particularly in cases of respiratory and infectious diseases. The accuracy in Arabic was lower compared to that of English for the majority of models. Microsoft Copilot exhibited relatively higher reproducibility and inter-run agreement (Cohen’s Kappa = 0.836 English, 0.815 Arabic, p < 0.001 for both), while Gemini showed the lowest reproducibility. For inter-language agreement in general, Copilot showed the highest Cohen’s Kappa of 0.701 for English and Arabic (p < 0.001). Conclusions: In our evaluation, Microsoft Copilot demonstrated relatively higher performance in pediatric drug error detection compared to the other AI models. The decreased performance in Arabic points toward the requirement of improved multilingual training for supporting equal AI aid across languages. This study highlights the importance of human oversight and domain-based training for AI tools in pediatric pharmacotherapy.

## Full-text entities

- **Diseases:** respiratory and infectious diseases (MESH:D012141)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12786879/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12786879/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/PMC12786879/full.md

---
Source: https://tomesphere.com/paper/PMC12786879